sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

Downloaders data storage organization #286

Open carlosparadis opened 3 months ago

carlosparadis commented 3 months ago

The issues #275 #282 #284 #285 are affected by this issue.

@Ssunoo2 @ian-lastname @anthonyjlau to centralize discussion, please use this issue to reach cosensus on how you plan to make the storage organization, file name, etc of your own refreshers + the JIRA refresher. Once we are clear on this here, you can move the final discussion to the first come of your respective issues.

Ssunoo2 commented 3 months ago

I'll start by just posting what the current storage organization is:

Jira downloader:

../../rawdata/issue_tracker/geronimo/issue_comments/
../../rawdata/issue_tracker/geronimo/issues/

Changed from

../../rawdata/issue_tracker/

I just make new directories for project name and issues or issue_comments respectively

Github Downloader:

Unchanged:

../../rawdata/github/kaiaulu/issue/
../../rawdata/github/kaiaulu/pull_request/
../../rawdata/github/kaiaulu/issue_or_pr_comment/
../../rawdata/github/kaiaulu/commit/
anthonyjlau commented 3 months ago

Bugzilla Downloader

Currently, the bugzilla_showcase notebook uses 3 different methods to download data: Traditional Perceval, Perceval's REST API, and Bugzilla's REST API. The one that I will be using for the refresher is Bugzilla's REST API.

Bugzilla's REST API Downloader Storage Organization

I will be using the current storage organization used for bugzilla issues as it is the same format as the GitHub version above.

../../rawdata/bugzilla/redhat/issues
../../rawdata/bugzilla/redhat/issues_comments
carlosparadis commented 3 months ago

Specification change

@Ssunoo2 there is something wrong with your filepath. I remember we agreed we should include in the path jira for consistency with bugzilla. In that sense, your issue_tracker folder should be called jira instead, since both of them are issue trackers.

In addition to that, and the primary reason why I wanted to create this issue to compare side by side, is that the project organization is counter-intuitive as it is on Kaiaulu (and I believe there were even some confusion of your group early on why the files were organized in this manner).

We should organize the information at project level, i.e.:

Bugzilla

Instead of:

../../rawdata/bugzilla/redhat/issues
../../rawdata/bugzilla/redhat/issues_comments

We would have:

../../rawdata/redhat/bugzilla/issues
../../rawdata/redhat/bugzilla/issues_comments

Jira

And instead of:

../../rawdata/jira/geronimo/issue_comments/
../../rawdata/jira/geronimo/issues/

We would have:

../../rawdata/geronimo/jira/issue_comments/
../../rawdata/geronimo/jira/issues/

Motivation

The reason for that, is generally someone running multi-project analysis is thinking the data "per project" rather than "per data source". In addition, if we are discussing about a particular project, and I would like to reproduce your analysis, I may need to ask you to send me "the data of the project". In the current organization, you would need to check every folder to fish for data the project has. Whereas in the new organization you simply zip the folder with said project name and send it over. Lastly, in the project folder organization, you can very quickly assess what data you have by opening the project folder. As it is, you also have to go check each folder.

Anomaly Case 1

There are some strange cases out there, that I want to make sure you give proper consideration as you write the refresher of these downloaders. The first one is the HADOOP project. @Ssunoo2 this affects you the most since this is a JIRA project.

If you look at Hadoop on GitHub (https://github.com/apache/hadoop), particularly the commits, you will see it can have multiple JIRA IDs. You can imagine the mess it turned out to be trying to manage that in the current folder organization.

Let's assume the proposed new organization with the downloader logic you currently have for JIRA. I will focus on the issue folder since what works for issues would work for comments folder.

You would then have:

../../rawdata/hadoop/jira/issues/HDFS_...
../../rawdata/hadoop/jira/issues/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/YARN_...

All in one folder. Would your refresher function work in this case? Or would it break assuming all files in there are from a single issue id? If it will break, then we need to add some logic to discern based on your issue key. That being the case, notice how much saner is that this data is contained inside the hadoop folder.

We could technically make a sub-folder for every issue key, something like:

../../rawdata/hadoop/jira/issues/hdfs/HDFS_...
../../rawdata/hadoop/jira/issues/mapreduce/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/yarn/YARN_...

However I worry this may complicate the folder hierarchy too much due to its depth.

Anomaly Case 2

The other anomaly case is the Spring Framework. You can read it about it here: https://spring.io/blog/2019/01/15/spring-framework-s-migration-from-jira-to-github-issues

Here's Spring GitHub: https://github.com/spring-projects/spring-framework/commits/main/

Basically, Spring used to have JIRA, and moved on to managing issues on GitHub (e.g. https://github.com/spring-projects/spring-framework/issues/16906). Now, how would this migration look like using your downloaders and this new proposed file organization? This one is likely harmless. It would be:

../../rawdata/spring/jira/issues/SPR_...
../../rawdata/spring/github/issues/spring-projects_spring-framework_...

Please give some thought to the above in one of your internal meetings. This is why I crated a separate issue, as it affects all of you. I'd also recommend you (@Ssunoo2) edit your post with how GitHub saves, and that @ian-lastname make a post on how the mailing list downloader saves here. You want to have them all side by side to make sure the organization is consistent.

ian-lastname commented 3 months ago

Mbox uses the helix.yml config. Going by how the Jira save file path is now done, i'll make the storage organization for mbox as follows: ../../rawdata/helix/mbox There aren't separate kinds of mail to look out for, so there is no need to create separate folders.

carlosparadis commented 3 months ago

@ian-lastname your mbox architecture will likely be a bit more complex than that. I want you to take a look on OpenSSL as a reference point: https://www.openssl.org/community/mailinglists.html

As you can see, OpenSSL (and in general any open source project), generally have multiple mailing lists. One for users, other for developers, and so goes on. In addition to that, a single mailing list may have multiple archives. See for example:

openssl-dev archives    

https://marc.info/?l=openssl-dev
https://www.mail-archive.com/openssl-dev@openssl.org/
https://groups.google.com/groups?group=mailing.openssl.dev

Has 3 archives. Now you may wonder why would someone download data from 3 archives for the same mailing list. This s because sometimes the archives cover different periods of a mailing list existence. E.g. Google Groups could be from 2009-2013, MARC from 2008-2016, and Google groups some overlap of both.

Your folder organization has to accommodate this. I'd argue your situation is a bit similar to the case of HADOOP, having multiple JIRA issues into a single project. So please discuss this with your group too and afterwards edit your proposal on how OpenSSL would look like as a folder organization.

anthonyjlau commented 3 months ago
../../rawdata/helix/mbox/openssl-dev/marc
../../rawdata/helix/mbox/openssl-dev/googlegroups
../../rawdata/helix/mbox/openssl-users/googlegroups
anthonyjlau commented 3 months ago

As we discussed on call, for projects that have multiple project keys (Anomaly Case 1), we will be using this format to organize the folder structure:

../../rawdata/hadoop/jira/issues/hdfs/HDFS_...
../../rawdata/hadoop/jira/issues/mapreduce/MAPREDUCE_...
../../rawdata/hadoop/jira/issues/yarn/YARN_...

We are using this structure because we don't have to make edits to our existing functions that look for files.

For Anomaly Case 2, we decided that we do not need to worry about it because it should not affect the current structure.

For the Mbox folder structure, we will use this structure:

../../rawdata/helix/mbox/openssl-dev/marc
../../rawdata/helix/mbox/openssl-dev/googlegroups
../../rawdata/helix/mbox/openssl-users/googlegroups

This structure separates each list and further separates the archive in each list.

anthonyjlau commented 3 months ago

Here is my suggested change in the config file format.

Multiple project keys

# example for cases that have multiple project keys
issue_tracker:
  jira:
    # Obtained from the project's JIRA URL
    domain: https://issues.apache.org/jira
    project_key:
    - hdfs
    - mapreduce
    - yarn

    # Download using `download_jira_data.Rmd`
    issues:
      - ../../rawdata/hadoop/jira/issues/hdfs
      - ../../rawdata/hadoop/jira/issues/mapreduce
      - ../../rawdata/hadoop/jira/issues/yarn
    issue_comments: 
      - ../../rawdata/hadoop/jira/issues_comments/hdfs
      - ../../rawdata/hadoop/jira/issues_comments/mapreduce
      - ../../rawdata/hadoop/jira/issues_comments/yarn

Mbox changes

mailing_list:
  # Where is the mbox located locally?
  mbox:
    - ../../rawdata/helix/mbox/openssl-dev/marc
    - ../../rawdata/helix/mbox/openssl-dev/googlegroups
    - ../../rawdata/helix/mbox/openssl-users/googlegroups
carlosparadis commented 3 months ago

@anthonyjlau @ian-lastname

Mail

Concerning the mbox, there is more than just the paths that needs to be changed. This is the full extent of the mailing list information:

https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/conf/apr.yml#L47-L54

Contrast to openssl:

https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/conf/openssl.yml#L47-L55

Minimally, you may have a mbox fil that you acquired from another project. But alternatively you may need to use one of Kaiaulu downloaders to get the data. Check what Kaiaulu functions need to execute (that's @ian-lastname current task to modify it to a refresher), and try to update the specification above before proceeding.

Issue Tracker

In the off-chance the project migrated the domain of their JIRA issue tracker, your config file proposal will break, since it assumes one domain for all the issue keys. Another concern I have is that if you mimic the enumeration you have done on project_key, issues, and issue_comments, there is this implicit assumption of order across them. Could you propose a different template here, under jira, the user specify (domain,project_key,issues,issue_comments) per issue key? This would make more explicit the information of each group. Do this in a separate comment, so we can consider the pros and cons side by side of what you have vs what the other would look like.

carlosparadis commented 3 months ago

@Ssunoo2 You will face the same consideration for your GitHub config file: https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/conf/openssl.yml#L65-L70

The anomaly case it is most likely for you to experience on GitHub would be project issues scattered across different GitHub projects. I have not encountered that yet, but I would not be surprised if they existed. Regardless, the solution would mimic what is decided for the JIRA config file.

anthonyjlau commented 2 months ago

Here is the updated version of the jira data storage:

issue_tracker:

  # each field in Jira will be a project key
  jira:
    project_key_1:
     # Obtained from the project's JIRA URL
     domain: https://issues.apache.org/jira/hdfs
     project_key: HDFS
      # Download using download_jira_data.Rmd
      issues: ../../rawdata//hadoop/jira/issues/hdfs
      issue_comments: ../../rawdata//hadoop/jira/issues_comments/hdfs

    project_key_2:
     # Obtained from the project's JIRA URL
     domain: https://issues.apache.org/jira/mapreduce
     project_key: MAPREDUCE
     # Download using download_jira_data.Rmd
     issues: ../../rawdata//hadoop/jira/issues/mapreduce
     issue_comments: ../../rawdata//hadoop/jira/issues_comments/mapreduce

    project_key_3:
     # Obtained from the project's JIRA URL
     domain: https://issues.apache.org/jira/yarn
     project_key: YARN
     # local folder path
     issues: ../../rawdata//hadoop/jira/issues/yarn
     issue_comments: ../../rawdata//hadoop/jira/issues_comments/yarn
anthonyjlau commented 2 months ago

Not sure if I followed correctly but the mailing list config part should look something like this:

Carlos Edit: I modified the config below.

mailing_list:
  mod_mbox: 
    mail_key_1:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
    mail_key_2:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
  pipermail:
    mail_key_1:
      archive_url: http://some/pipermail/url
      mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/
carlosparadis commented 2 months ago

@anthonyjlau @ian-lastname

I modified the config above so it tries to stay consistent with the folder depth of the other downloaders and account for the information needed for the functions. I also changed from project_key_1 to mail_key_1 since they are all from the same project, but just the mailing list that serves a different purpose.

@ian-lastname try to work with this and post here if for some reason it doesn't work with the functions you are using to refresh.

ian-lastname commented 2 months ago
mailing_list:
  mod_mbox:
    domain: http://mail-archives.apache.org/mod_mbox/geronimo-user
    mail_key_1:
      key: geronimo-dev
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
    mail_key_2:
      key: geronimo-user
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
  pipermail:
    mail_key_1:
      archive_url: http://some/pipermail/url
      mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/

So, I modified the mod_mbox config. The reason why I changed it to this is because the downloader function was already made to put together the full url for the download using a base domain (domain) and a mailing list (key). Plus, with the way I changed it, I can easily attain the name of the mailing list so that I can put it into the file name of the downloaded mbox file.

Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please.

carlosparadis commented 2 months ago

@ian-lastname "because the function already does it" is not a good rationale: I modified the config so both pipermail and mod_mbox are consistent in the way the user uses the information. It is also more clear for someone to see a URL that they can post on the browser than figuring out what a key is. Your config seems to also be duplicating the key on the domain url.

The other point of concern is domain. I am not sure if there will be a case a project's mailing list can end up in two domains for mod-mbox. So it is better to keep it flexible per project_key so we do not have to modify in the future.

Unless you made any other change, stick to https://github.com/sailuh/kaiaulu/issues/286#issuecomment-2040898175.

You can modify to be a url in this line:

https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/R/mail.R#L148

Just replace the base_url,mailinglist to a url parameter you take as input to the function.

Also, I don't think there is a notebook on the pipermail download function. Correct me if I'm wrong please.

Seems not. Please add it to:

https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_mod_mbox.Rmd

When you are done with the changes!

carlosparadis commented 2 months ago

As far as the key is concerned: Before you worry about that in mod_mbox, try to find an example on pipermail and run the function.

https://mail.python.org/pipermail/mailman-users/

I believe Python can be used as an example. In fact, that's where the pipermail code originated in 2021:

https://mail.python.org/pipermail/mailman-users/2012-October/074208.html

Let me know how running this goes. Note you will need to modify the pipermail function to also allow to control the from_year and to_year parameter. Make sure to find another few pipermail mailing list you can try the function out.

See https://github.com/sailuh/kaiaulu/discussions/92 for context.

Ssunoo2 commented 2 months ago

Here is the format for the jira and github config files:

issue_tracker:
  jira:
    project_key_1:
    # Obtained from the project's JIRA URL
    domain: https://github.com/sailuh/kaiaulu
    project_key: KAIAULU
    # Download using `download_jira_data.Rmd`
    issues: ../../rawdata/geronimo/jira/issues/
    issue_comments: ../../rawdata/geronimo/jira/issue_comments/
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/
      issue: ../../rawdata/kaiaulu/github/issue/
      pull_request: ../../kaiaulu/github/pull_request/
      commit: ../../rawdata/kaiaulu/github/commit/

Please feel free to comment on anything that is formatted incorrectly

carlosparadis commented 2 months ago

@Ssunoo2

Just post a new comment below with the corrected version instead of editing your existing one so it is not confusing to follow-up later:

The domain information for Kaiaulu's JIRA is wrong:

issue_tracker:
  jira:
    # Obtained from the project's JIRA URL
    domain: https://sailuh.atlassian.net
    project_key: SAILUH

This should be it instead. Try your downloader against it to see if it works. Note Kaiaulu domain is different than the other JIRAs that uses apache.

Also, did you modify the existing end points in GitHub (commit, pr, etc) so they are folders and can refresh? I don't remember.

Could you add another project to github for Kaiaulu, including your fork information to see how it looks like?

Also I think the endpoints on your config do not agree with what Anthony put here: https://github.com/sailuh/kaiaulu/issues/286#issuecomment-2040836331

There should be another folder at the end of the endpoints. For JIRA that is named after the JIRA project key. For GitHub, the equivalent is the owner_repo combination. So in Kaiaulu config you would have:

issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu

for the main repo,

but if I was also downloading and tracking a fork, then that would be:

issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu

You can include your fork as an example of project_key_2 here so we can discuss, but don't include in your actual commit since we do not need to download anything from there. So we have a realistic example, please create a codeface.conf

And edit so it include on project_key_1: https://github.com/siemens/codeface

And on project_key_2 Nicole's fork: https://github.com/lfd/codeface/tree/nicole-updates

Note on the Codeface config file, under the branch region:

https://github.com/sailuh/kaiaulu/blob/2bc8d141c90c0eac635631f69531d4c406432940/conf/kaiaulu.yml#L43-L44

You will include an additional line below master called - nicole_updates

Ssunoo2 commented 2 months ago

Is this looking right?

issue_tracker:
  jira:
    project_key_1:
      # Obtained from the project's JIRA URL
      domain: https://sailuh.atlassian.net
      project_key: SAILUH
      # Download using `download_jira_data.Rmd`
      issues: ../../rawdata/kaiaulu/jira/issues/sailuh
      issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh
    # project_key_2:
      # Obtained from the project's JIRA URL
      # domain: https://sailuh.atlassian.net
      # project_key: ssunoo2
      # Download using `download_jira_data.Rmd`
      # issues: ../../rawdata/kaiaulu/jira/issues/ssunoo2
      # issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/ssunoo2
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
    # project_key_2:
      # # Obtained from the project's GitHub URL
      # owner: sailuh
      # repo: kaiaulu
      # # Download using `download_github_comments.Rmd`
      # issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      # issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      # pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      # commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/

For JIRA, I appended project_key to the end of the file path. For Github, I appended owner_repo to the end of the file path. I'll work on testing and make the codeface config file. Regarding the refresh for the pull requests and commits, I had originally thought I was supposed to but you corrected me and specified issues and comments only during week 11

carlosparadis commented 2 months ago

No. There is no ssunoo2 project key in Kaiaulu JIRA. We should not include fictitious examples even if commented on the config file. It will confuse users. Remove project_key_2 from the jira portion.

For project_key_2 on GitHub is also wrong... the fork is not owned by sailuh and kaiaulu, but rather the owner is ssunoo2 and the repo is kaiaulu. I am a bit worried the config file may not be making any sense to you at this point. Should we go over this briefly on call if it helps?

Ssunoo2 commented 2 months ago

Here is the updated config format for the issue_trackers:

issue_tracker:
  jira:
    project_key_1:
      # Obtained from the project's JIRA URL
      domain: https://sailuh.atlassian.net
      project_key: SAILUH
      # Download using `download_jira_data.Rmd`
      issues: ../../rawdata/kaiaulu/jira/issues/sailuh/
      issue_comments: ../../rawdata/kaiaulu/jira/issue_comments/sailuh/
  github:
    project_key_1:
      # Obtained from the project's GitHub URL
      owner: sailuh
      repo: kaiaulu
      # Download using `download_github_comments.Rmd`
      issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/sailuh_kaiaulu/
      issue: ../../rawdata/kaiaulu/github/issue/sailuh_kaiaulu/
      refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/sailuh_kaiaulu/
      pull_request: ../../kaiaulu/github/pull_request/sailuh_kaiaulu/
      commit: ../../rawdata/kaiaulu/github/commit/sailuh_kaiaulu/
    # project_key_2:
      # # Obtained from the project's GitHub URL
      # owner: ssunoo2
      # repo: kaiaulu
      # # Download using `download_github_comments.Rmd`
      # issue_or_pr_comment: ../../rawdata/kaiaulu/github/issue_or_pr_comment/ssunoo2_kaiaulu/
      # issue: ../../rawdata/kaiaulu/github/issue/ssunoo2_kaiaulu/
      # refresh_issues: ../..rawdata/kaiaulu/github/refresh_issues/ssunoo2_kaiaulu/
      # pull_request: ../../kaiaulu/github/pull_request/ssunoo2_kaiaulu/
      # commit: ../../rawdata/kaiaulu/github/commit/ssunoo2_kaiaulu/

Note that a new folder 'refresh_issues' is created as a result of #282