Open carlosparadis opened 3 months ago
I'll start by just posting what the current storage organization is:
../../rawdata/issue_tracker/geronimo/issue_comments/
../../rawdata/issue_tracker/geronimo/issues/
Changed from
../../rawdata/issue_tracker/
I just make new directories for project name and issues or issue_comments respectively
Unchanged:
../../rawdata/github/kaiaulu/issue/
../../rawdata/github/kaiaulu/pull_request/
../../rawdata/github/kaiaulu/issue_or_pr_comment/
../../rawdata/github/kaiaulu/commit/
Currently, the bugzilla_showcase notebook uses 3 different methods to download data: Traditional Perceval, Perceval's REST API, and Bugzilla's REST API. The one that I will be using for the refresher is Bugzilla's REST API.
I will be using the current storage organization used for bugzilla issues as it is the same format as the GitHub version above.
../../rawdata/bugzilla/redhat/issues
../../rawdata/bugzilla/redhat/issues_comments
@Ssunoo2 there is something wrong with your filepath. I remember we agreed we should include in the path jira
for consistency with bugzilla
. In that sense, your issue_tracker
folder should be called jira
instead, since both of them are issue trackers.
In addition to that, and the primary reason why I wanted to create this issue to compare side by side, is that the project organization is counter-intuitive as it is on Kaiaulu (and I believe there were even some confusion of your group early on why the files were organized in this manner).
We should organize the information at project level, i.e.:
Instead of:
../../rawdata/bugzilla/redhat/issues
../../rawdata/bugzilla/redhat/issues_comments
We would have:
../../rawdata/redhat/bugzilla/issues
../../rawdata/redhat/bugzilla/issues_comments
And instead of:
../../rawdata/jira/geronimo/issue_comments/
../../rawdata/jira/geronimo/issues/
We would have:
../../rawdata/geronimo/jira/issue_comments/
../../rawdata/geronimo/jira/issues/
The reason for that, is generally someone running multi-project analysis is thinking the data "per project" rather than "per data source". In addition, if we are discussing about a particular project, and I would like to reproduce your analysis, I may need to ask you to send me "the data of the project". In the current organization, you would need to check every folder to fish for data the project has. Whereas in the new organization you simply zip the folder with said project name and send it over. Lastly, in the project folder organization, you can very quickly assess what data you have by opening the project folder. As it is, you also have to go check each folder.
There are some strange cases out there, that I want to make sure you give proper consideration as you write the refresher of these downloaders. The first one is the HADOOP project. @Ssunoo2 this affects you the most since this is a JIRA project.
If you look at Hadoop on GitHub (https://github.com/apache/hadoop), particularly the commits, you will see it can have multiple JIRA IDs. You can imagine the mess it turned out to be trying to manage that in the current folder organization.
Let's assume the proposed new organization with the downloader logic you currently have for JIRA. I will focus on the issue folder since what works for issues would work for comments folder.
You would then have:
../../rawdata/hadoop/jira/issues/HDFS_...
../../rawdata/hadoop/jira/issues/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/YARN_...
All in one folder. Would your refresher function work in this case? Or would it break assuming all files in there are from a single issue id? If it will break, then we need to add some logic to discern based on your issue key. That being the case, notice how much saner is that this data is contained inside the hadoop folder.
We could technically make a sub-folder for every issue key, something like:
../../rawdata/hadoop/jira/issues/hdfs/HDFS_...
../../rawdata/hadoop/jira/issues/mapreduce/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/yarn/YARN_...
However I worry this may complicate the folder hierarchy too much due to its depth.
The other anomaly case is the Spring Framework. You can read it about it here: https://spring.io/blog/2019/01/15/spring-framework-s-migration-from-jira-to-github-issues
Here's Spring GitHub: https://github.com/spring-projects/spring-framework/commits/main/
Basically, Spring used to have JIRA, and moved on to managing issues on GitHub (e.g. https://github.com/spring-projects/spring-framework/issues/16906
). Now, how would this migration look like using your downloaders and this new proposed file organization? This one is likely harmless. It would be:
../../rawdata/spring/jira/issues/SPR_...
../../rawdata/spring/github/issues/spring-projects_spring-framework_...
Please give some thought to the above in one of your internal meetings. This is why I crated a separate issue, as it affects all of you. I'd also recommend you (@Ssunoo2) edit your post with how GitHub saves, and that @ian-lastname make a post on how the mailing list downloader saves here. You want to have them all side by side to make sure the organization is consistent.
Mbox uses the helix.yml config. Going by how the Jira save file path is now done, i'll make the storage organization for mbox as follows:
../../rawdata/helix/mbox
There aren't separate kinds of mail to look out for, so there is no need to create separate folders.
@ian-lastname your mbox architecture will likely be a bit more complex than that. I want you to take a look on OpenSSL as a reference point: https://www.openssl.org/community/mailinglists.html
As you can see, OpenSSL (and in general any open source project), generally have multiple mailing lists. One for users, other for developers, and so goes on. In addition to that, a single mailing list may have multiple archives. See for example:
openssl-dev archives
https://marc.info/?l=openssl-dev
https://www.mail-archive.com/openssl-dev@openssl.org/
https://groups.google.com/groups?group=mailing.openssl.dev
Has 3 archives. Now you may wonder why would someone download data from 3 archives for the same mailing list. This s because sometimes the archives cover different periods of a mailing list existence. E.g. Google Groups could be from 2009-2013, MARC from 2008-2016, and Google groups some overlap of both.
Your folder organization has to accommodate this. I'd argue your situation is a bit similar to the case of HADOOP, having multiple JIRA issues into a single project. So please discuss this with your group too and afterwards edit your proposal on how OpenSSL would look like as a folder organization.
../../rawdata/helix/mbox/openssl-dev/marc
../../rawdata/helix/mbox/openssl-dev/googlegroups
../../rawdata/helix/mbox/openssl-users/googlegroups
As we discussed on call, for projects that have multiple project keys (Anomaly Case 1), we will be using this format to organize the folder structure:
../../rawdata/hadoop/jira/issues/hdfs/HDFS_...
../../rawdata/hadoop/jira/issues/mapreduce/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/yarn/YARN_...
We are using this structure because we don't have to make edits to our existing functions that look for files.
For Anomaly Case 2, we decided that we do not need to worry about it because it should not affect the current structure.
For the Mbox folder structure, we will use this structure:
../../rawdata/helix/mbox/openssl-dev/marc
../../rawdata/helix/mbox/openssl-dev/googlegroups
../../rawdata/helix/mbox/openssl-users/googlegroups
This structure separates each list and further separates the archive in each list.
Here is my suggested change in the config file format.
# example for cases that have multiple project keys
issue_tracker:
jira:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira
project_key:
- hdfs
- mapreduce
- yarn
# Download using `download_jira_data.Rmd`
issues:
- ../../rawdata/hadoop/jira/issues/hdfs
- ../../rawdata/hadoop/jira/issues/mapreduce
- ../../rawdata/hadoop/jira/issues/yarn
issue_comments:
- ../../rawdata/hadoop/jira/issues_comments/hdfs
- ../../rawdata/hadoop/jira/issues_comments/mapreduce
- ../../rawdata/hadoop/jira/issues_comments/yarn
mailing_list:
# Where is the mbox located locally?
mbox:
- ../../rawdata/helix/mbox/openssl-dev/marc
- ../../rawdata/helix/mbox/openssl-dev/googlegroups
- ../../rawdata/helix/mbox/openssl-users/googlegroups
@anthonyjlau @ian-lastname
Concerning the mbox, there is more than just the paths that needs to be changed. This is the full extent of the mailing list information:
https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/conf/apr.yml#L47-L54
Contrast to openssl:
Minimally, you may have a mbox fil that you acquired from another project. But alternatively you may need to use one of Kaiaulu downloaders to get the data. Check what Kaiaulu functions need to execute (that's @ian-lastname current task to modify it to a refresher), and try to update the specification above before proceeding.
In the off-chance the project migrated the domain of their JIRA issue tracker, your config file proposal will break, since it assumes one domain for all the issue keys. Another concern I have is that if you mimic the enumeration you have done on project_key, issues, and issue_comments, there is this implicit assumption of order across them. Could you propose a different template here, under jira
, the user specify (domain,project_key,issues,issue_comments) per issue key? This would make more explicit the information of each group. Do this in a separate comment, so we can consider the pros and cons side by side of what you have vs what the other would look like.
@Ssunoo2 You will face the same consideration for your GitHub config file: https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/conf/openssl.yml#L65-L70
The anomaly case it is most likely for you to experience on GitHub would be project issues scattered across different GitHub projects. I have not encountered that yet, but I would not be surprised if they existed. Regardless, the solution would mimic what is decided for the JIRA config file.
Here is the updated version of the jira data storage:
issue_tracker:
# each field in Jira will be a project key
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira/hdfs
project_key: HDFS
# Download using download_jira_data.Rmd
issues: ../../rawdata//hadoop/jira/issues/hdfs
issue_comments: ../../rawdata//hadoop/jira/issues_comments/hdfs
project_key_2:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira/mapreduce
project_key: MAPREDUCE
# Download using download_jira_data.Rmd
issues: ../../rawdata//hadoop/jira/issues/mapreduce
issue_comments: ../../rawdata//hadoop/jira/issues_comments/mapreduce
project_key_3:
# Obtained from the project's JIRA URL
domain: https://issues.apache.org/jira/yarn
project_key: YARN
# local folder path
issues: ../../rawdata//hadoop/jira/issues/yarn
issue_comments: ../../rawdata//hadoop/jira/issues_comments/yarn
Not sure if I followed correctly but the mailing list config part should look something like this:
Carlos Edit: I modified the config below.
mailing_list:
mod_mbox:
mail_key_1:
archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
mail_key_2:
archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
pipermail:
mail_key_1:
archive_url: http://some/pipermail/url
mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/
@anthonyjlau @ian-lastname
I modified the config above so it tries to stay consistent with the folder depth of the other downloaders and account for the information needed for the functions. I also changed from project_key_1
to mail_key_1
since they are all from the same project, but just the mailing list that serves a different purpose.
@ian-lastname try to work with this and post here if for some reason it doesn't work with the functions you are using to refresh.
mailing_list:
mod_mbox:
domain: http://mail-archives.apache.org/mod_mbox/geronimo-user
mail_key_1:
key: geronimo-dev
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
mail_key_2:
key: geronimo-user
mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
pipermail:
mail_key_1:
archive_url: http://some/pipermail/url
mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/
So, I modified the mod_mbox config. The reason why I changed it to this is because the downloader function was already made to put together the full url for the download using a base domain (domain) and a mailing list (key). Plus, with the way I changed it, I can easily attain the name of the mailing list so that I can put it into the file name of the downloaded mbox file.
Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please.
@ian-lastname "because the function already does it" is not a good rationale: I modified the config so both pipermail and mod_mbox are consistent in the way the user uses the information. It is also more clear for someone to see a URL that they can post on the browser than figuring out what a key
is. Your config seems to also be duplicating the key on the domain url.
The other point of concern is domain. I am not sure if there will be a case a project's mailing list can end up in two domains for mod-mbox. So it is better to keep it flexible per project_key so we do not have to modify in the future.
Unless you made any other change, stick to https://github.com/sailuh/kaiaulu/issues/286#issuecomment-2040898175.
You can modify to be a url in this line:
https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/R/mail.R#L148
Just replace the base_url,mailinglist
to a url
parameter you take as input to the function.
Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please.
Seems not. Please add it to:
https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_mod_mbox.Rmd
When you are done with the changes!
As far as the key is concerned: Before you worry about that in mod_mbox
, try to find an example on pipermail and run the function.
https://mail.python.org/pipermail/mailman-users/
I believe Python can be used as an example. In fact, that's where the pipermail code originated in 2021:
https://mail.python.org/pipermail/mailman-users/2012-October/074208.html
Let me know how running this goes. Note you will need to modify the pipermail function to also allow to control the from_year
and to_year
parameter. Make sure to find another few pipermail mailing list you can try the function out.
See https://github.com/sailuh/kaiaulu/discussions/92 for context.
Here is the format for the jira and github config files:
issue_tracker:
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://github.com/sailuh/kaiaulu
project_key: KAIAULU
# Download using `download_jira_data.Rmd`
issues: ../../rawdata/geronimo/jira/issues/
issue_comments: ../../rawdata/geronimo/jira/issue_comments/
github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/
issue: ../../rawdata/kaiaulu/github/issue/
pull_request: ../../kaiaulu/github/pull_request/
commit: ../../rawdata/kaiaulu/github/commit/
Please feel free to comment on anything that is formatted incorrectly
@Ssunoo2
Just post a new comment below with the corrected version instead of editing your existing one so it is not confusing to follow-up later:
The domain information for Kaiaulu's JIRA is wrong:
issue_tracker:
jira:
# Obtained from the project's JIRA URL
domain: https://sailuh.atlassian.net
project_key: SAILUH
This should be it instead. Try your downloader against it to see if it works. Note Kaiaulu domain is different than the other JIRAs that uses apache.
Also, did you modify the existing end points in GitHub (commit, pr, etc) so they are folders and can refresh? I don't remember.
Could you add another project to github for Kaiaulu, including your fork information to see how it looks like?
Also I think the endpoints on your config do not agree with what Anthony put here: https://github.com/sailuh/kaiaulu/issues/286#issuecomment-2040836331
There should be another folder at the end of the endpoints. For JIRA that is named after the JIRA project key. For GitHub, the equivalent is the owner_repo combination. So in Kaiaulu config you would have:
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu
for the main repo,
but if I was also downloading and tracking a fork, then that would be:
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu
You can include your fork as an example of project_key_2 here so we can discuss, but don't include in your actual commit since we do not need to download anything from there. So we have a realistic example, please create a codeface.conf
And edit so it include on project_key_1: https://github.com/siemens/codeface
And on project_key_2 Nicole's fork: https://github.com/lfd/codeface/tree/nicole-updates
Note on the Codeface config file, under the branch region:
You will include an additional line below master called - nicole_updates
Is this looking right?
issue_tracker:
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://sailuh.atlassian.net
project_key: SAILUH
# Download using `download_jira_data.Rmd`
issues: ../../rawdata/kaiaulu/jira/issues/sailuh
issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh
# project_key_2:
# Obtained from the project's JIRA URL
# domain: https://sailuh.atlassian.net
# project_key: ssunoo2
# Download using `download_jira_data.Rmd`
# issues: ../../rawdata/kaiaulu/jira/issues/ssunoo2
# issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/ssunoo2
github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
# project_key_2:
# # Obtained from the project's GitHub URL
# owner: sailuh
# repo: kaiaulu
# # Download using `download_github_comments.Rmd`
# issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
# issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
# pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
# commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
For JIRA, I appended project_key to the end of the file path. For Github, I appended owner_repo to the end of the file path. I'll work on testing and make the codeface config file. Regarding the refresh for the pull requests and commits, I had originally thought I was supposed to but you corrected me and specified issues and comments only during week 11
No. There is no ssunoo2
project key in Kaiaulu JIRA. We should not include fictitious examples even if commented on the config file. It will confuse users. Remove project_key_2 from the jira portion.
For project_key_2 on GitHub is also wrong... the fork is not owned by sailuh and kaiaulu, but rather the owner is ssunoo2 and the repo is kaiaulu. I am a bit worried the config file may not be making any sense to you at this point. Should we go over this briefly on call if it helps?
Here is the updated config format for the issue_trackers:
issue_tracker:
jira:
project_key_1:
# Obtained from the project's JIRA URL
domain: https://sailuh.atlassian.net
project_key: SAILUH
# Download using `download_jira_data.Rmd`
issues: ../../rawdata/kaiaulu/jira/issues/sailuh/
issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh/
github:
project_key_1:
# Obtained from the project's GitHub URL
owner: sailuh
repo: kaiaulu
# Download using `download_github_comments.Rmd`
issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/sailuh_kaiaulu/
pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
# project_key_2:
# # Obtained from the project's GitHub URL
# owner: ssunoo2
# repo: kaiaulu
# # Download using `download_github_comments.Rmd`
# issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu/
# issue: ../../rawdata/kaiaulu/github/issue/ssunoo2_kaiaulu/
# refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/ssunoo2_kaiaulu/
# pull_request: ../../kaiaulu/github/pull_request/ssunoo2_kaiaulu/
# commit: ../../rawdata/kaiaulu/github/commit/ssunoo2_kaiaulu/
Note that a new folder 'refresh_issues' is created as a result of #282
The issues #275 #282 #284 #285 are affected by this issue.
@Ssunoo2 @ian-lastname @anthonyjlau to centralize discussion, please use this issue to reach cosensus on how you plan to make the storage organization, file name, etc of your own refreshers + the JIRA refresher. Once we are clear on this here, you can move the final discussion to the first come of your respective issues.