sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

Refresher Capability for MBOX Downloader (Milestone 2) #284

Open ian-lastname opened 6 months ago

ian-lastname commented 6 months ago

1. Purpose

The purpose of this issue is to add refresh capability for the mod mbox downloader and pipermail downloader. I'll have to create a refresh function for both downloaders, as well as a parser function that parses the latest downloaded mail file. There are two mod mbox downloader functions: download_mod_mbox and download_mod_mbox_per_month. Since pagination is required for a refresh function, I will only be focusing on the download_mod_mbox_per_month function.

2. Process

I will base my changes and new code on the already existing code related to the mbox downloader and parser. For the refresh capability, I will look through Sean's jira downloader refresher to get a good idea on how I should make it. Though from what I already know about it, I will most definitely be making a new function that takes a date of some sort.

3. Endpoints

From the meeting, apparently I only have year and month to work with when it comes to end points. I'll do a bit more checking around just to make sure.

4. Task List

Refresher (Endpoint)

I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year.

Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)

Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)

New Parser: parse_mbox_latest_date(mbox_path)

Incorporating Month as an Endpoint Along With Year

Currently, the endpoint parameters for the downloader/refresher functions that take them only take a year (i.e. 2004). Due to this, the downloaders will always start at the beginning of the year when downloaded at a certain "from" year. It is 100% possible to make it so that the downloader can start at a specified month as well as a year. The logic in order to do so is as follows:

Pipermail: Manually Prompting Pipermail Refresher to Start After a Certain Year and Month

Pipermail archives have their archived mail in txt or txt.gz formats. Here is an example of a pipermail archive piper1 In this picture, you can see that the downloadable versions of each mail file are viewable with a link to the txt file. Clicking on the link takes you to this page: piper2 As you can see, this is a raw file of all the mail messages in April 2018. Notice the naming convention of the downloadable file, which is underlined in red. The file is named on a year-month basis. You'll want to download the file whose date you want to start from, and put it in the save folder in which you will be running the pipermail refresh on.

Next, you will want to rename your downloaded file to the correct naming format (i.e. openssl_mta_201804.mbox as per the second picture). With that, the refresher should start from the month and year that your downloaded file is from.

Chances are, you might not even need to name the file in the correct naming format; as long as you have the yearmonth aspect of the name and the correct extension (i.e. 201804.mbox should be enough to start from April 2018), it should work. You might not even need to actually manually download the file from the mail archive to begin with; just having a blank file with the correct naming convention (or at the very least yearmonth.mbox) should be sufficient enough as the refresher will just delete that file, then replace it with the actual mail file at that year and month.

ian-lastname commented 5 months ago

https://lists.apache.org/list.html?dev@apr.apache.org

example

ian-lastname commented 4 months ago
carlosparadis commented 4 months ago

@ian-lastname Please add here the notes requested during the last meeting Friday:

There was another item, what was it?

ian-lastname commented 4 months ago

@ian-lastname Please add here the notes requested during the last meeting Friday:

  • Screenshots / urls / examples of how the pipermail .txt file can be obtained to manually prompt your refresher to start after a given year and month

There was another item, what was it?

I remember the other item; it was to link to the part of the code in the pipermail refresher that would supposedly put a warning message when there is no file found error at a certain url. Turns out, I just removed the code that actually printed a warning message when the error is encountered.

carlosparadis commented 4 months ago

@ian-lastname If the code already exists, could you make a commit to just place it back? I have not start reviewing your code yet

carlosparadis commented 4 months ago

The pipermail mbox refresher has a main IF and ELSE. In the case the IF enters, it will default the entire code logic to download_pipermail.

Download pipermail downloads the main page of the mailing list archive (e.g. https://mta.openssl/pipermail/openssl-users/) this page contains the list of all URLs of the mbox as either .txt or .gz. Both are mbox in disguise, we only need to rename the file extensions.

download_pipermail will get the urls, download the appropriate files and rename. download_pipermail relies on this file to know if .gz or .txt will be available and what dates. Without said file, it is impossible to know which will be the case.

The Else portion of pipermail refresher will not rely on the file. Therefore, it will not know the year to end, other than system time, and will also not know whether txt, gz or both are available. In addition, the code logic for current year and last year was split into two functions. Combined with the txt or gz functions, this results in 4 functions being fired every year/month all the way to current year/month from system time. This generates a number of empty files saved, which are subsequently deleted as they are downloaded all the way to current year.

The rework of the else function should rely on the download_pipermail function, and re-obtain the list of all files, use the last file year_month, and then download only the files of either .txt or .gz according to the URLs extracted from said file. This will reduce the number of function calls to only 1 per year month, and also prevent firing for years and months that are not available (perhaps because the archive stopped storing data way before the current year date).

carlosparadis commented 4 months ago

download_mod_mbox was not tested on a project that the data was not available to current date, as most apache projects had them. I suspect there will be a problem where empty files will be saved (edit this comment later to refer to issue lihan posted about that or I did).