daomcgill commented 1 month ago

Purpose

Parallelize the refresh pipeline to efficiently handle the download and analysis of large projects concurrently.

Part 1 (Proof of Concept)

Process

For a proof of concept (POC), parallelize the parsing capabilities of .mbox files. Each job will handle the processing of one month's data, ensuring that tasks remain independent and manageable. Python will manage the job queue and dispatcher, assigning jobs to threads as resources become available (scheduling policy TBD).

Workflow

User uses CLI to initiate parsing of .mbox files via /exec R scripts.
/exec script parses the argument, and invokes the Python script.
Python manages the job queue, assigning jobs to threads.
Each thread calls /exec/parse_mbox.R to handle the parsing of its assigned file.
Scripts perform their tasks and save output.

Task List

[x] Define project structure. Find a home for Python scripts (cannot live in R/).
[ ] Format a configuration file to use with parse managers.
[ ] A new R/exec script to parse a single mbox file into a table calling the parse_mbox() function in it. R/exec scripts should be taking config files as input, and not its own set of parameters.
[x] A python script that can call said R script in parallel
[ ] Update documentation.

Libraries

httr
stringi
yaml
ThreadPoolExecutor

Part 2

Implement parallel processing for all downloaders. This section will be addressed after the successful implementation of Part 1.

References

Issue #248: Scaling Analysis with BatchJobPool Issue #231: Parallel Git Log Entity Analysis PR #234: Adds Parallelization Support for Git Log Entities

daomcgill commented 1 month ago

@carlosparadis Not sure if I am on the right track here, but this is my understanding of the issue so far. I am assuming that Python should be used for this, or should I use a package like parallel in R? I am also wondering whether the configuration should be included in the spark.yml, or if a new file should be used for it. Also, for the PR, should I make a branch based on 284-mbox-download-refresher rather than main?

carlosparadis commented 1 month ago

@daomcgill This issue has a bit more open-ended research to figure out, which is why I left for M2. The scope of the issue is more ambitious, however: You should do this for all the downloaders. This does not mean you will need to write their downloaders, but maybe their execs. However, mbox is a good starting point so let's work this out before looking at the others so we have an working proof of concept.

We can go either the R route or the Python route. The first thing I would like you to assess is if either allows us to parallelize on OS X. I think the R parallel route only works on Linux. I feel we will still want to have the Python route.

The second thing you need to give more thought is how we can parallelize this. Up until recently Kaiaulu did not have downloaders that would download chunks of data per file, let alone refresh them. With just one giant file to look at for downloaded data, this meant we could not just process each file concurrently. We are taking some inspiration from how Codeface does this on the Python end (as described here): https://github.com/sailuh/kaiaulu/issues/248#issue-1989833115

Here are more details:

Codeface Temporal Configs

For starters, consider Codeface also has project config files, but their specification differs from Kaiaulu. Codeface has a similar purpose as Kaiaulu to analyze projects. Codeface interface is a CLI, whereas Kaiaulu is an API. Regardless, both tools can be used to construct time-series analysis. Codeface has a fixed pipeline on how it analyze things once you enter one of its CLI commands:

https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/conf/busybox.conf#L14-L15

Note Codeface will give privilege to the user to specify a time window analysis of a particular project. If you do not specify one, I believe it defaults to every 3 months. Because Codeface will do a temporal analysis (which is not an assumption held in Kaiaulu), then it can identify opportunities to parallelize. For example, it will chunk the git log into pieces and add it to a pool:

https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/project.py#L107-L113

Temporal Mbox Parallel Processing

Conceptually, you can imagine this to be the equivalent of what you have on one month worth of data per .mbox file. Your refresher gives you the chunks of data saved to disk. This means while codeface slices the data on that function code block, yours is already sliced after you download it, and the refresh functions will ensure it remains that way. You could technically load the files to slice them differently, but for now the simple case is just to process each file separately in parallel.

To do that, I would like you to use the exec/ scripts. The idea thus is then, you have 1 month worth of data, and you have an R script that can be fired taking as input any given 1 month of data. The script then turns that file into a new output file. This exec script could simply be for starters one that calls parse_mbox(). That will invoke Perceval to tabulate the .mbox file in parallel, generating another folder worth of .csvs. It is a simple script, but can serve as proof of concept for more complex processing later on. This would be a form of temporal parallelization for one type of data.

Your Python code then, similar to Codeface could add each call to the script to a pool to fire it. The Python script can't live in R/. I believe in R packages, which Kaiaulu is and Codeface is not, it goes under a folder we don't have yet. This too needs to be checked.

Cross-Dataset Parallel Processing

Note Codeface will also do other stages, but the between stage I believe is sequential: https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/project.py#L116-L127

This is a reminder that nothing would stop us to just start firing exec scripts of other downloaders also in parallel to process them. This would be an intra-parallelization across data sources we could capitalize on.

Git Parallel Processing

At some point, I would like us to look into some analysis in Kaiaulu that could benefit from this. Nicole did one here: https://github.com/sailuh/kaiaulu/issues/231 however note I am prioritizing your efforts on looking at Python parallel libraries, rather than R. And because Kaiaulu is written in R, whereas that portion of the analysis is written in Python in Codeface, we can't parallelize the process without exposing Kaiaulu functionality via exec scripts.

This, I believe however, is fine. Code architecture-wise, the /exec folder is intended to host scripts that can be executed server-side in a cron job (meaning that the server will keep re-running the data downloader). This is why your refresh function should maintain a consistent state in the folder.

The place where you would want to run parallel processing is also server-side, in which case you are at /exec realm so to speak.

Please update your specifications accordingly with your newfound understanding, and feel free to ask more questions. This will need more joint thinking to identify opportunities here.

@nicolehoess feel free to chime in.

daomcgill commented 1 month ago

@carlosparadis I added an update. I broke down the task into Part 1 (PoC) and Part 2 (everything else). There are definitely some questions that still need answering, such as scheduling policies for both the I/O and CPU bound tasks, how the user will monitor jobs, retry policy, etc. I am thinking it might be better to use Thread Pooling for download and Process Pooling for tasks such as parsing. This is why I proposed having separate managers for the tasks, although the same manager could handle like-type tasks in the future. Having the YYYYMM format should help to simplify the identification and tracking of jobs. For this issue I think I might get a clearer idea if I start trying things out.

carlosparadis commented 1 month ago

@daomcgill We will not try to parallelize download. Making too many requests to a server will get you ip blocked. The only parallel processing will happen offline to turn the existing files in a folder into something useful. In this poc, we will run in parallel the task of going from raw data to table using parse_mbox(). This is the first step. Don't worry about cron, monitoring jobs or retry policy for now.

What I would like to see is just a python script that, giving a folder path to mbox, fires in parallel the R/exec script that turns them into a table.csv file.

To make it clear: We are not trying to replicate what Codeface does with one entire CLI that does everything. Kaiaulu architecture is different.

I therefore expect you would need:

A new R/exec script to parse a single mbox file into a table calling the parse_mbox() function in it. R/exec scripts should be taking config files as input, and not its own set of parameters.
A python script that can call said R script in parallel

Once we can run this, we can worry about other steps if at all needed.

daomcgill commented 1 month ago

@carlosparadis I’ve added a commit containing a first pass at this issue. While it’s still a work in progress, the basic functionality is in place, and it’s successfully performing the task at a fundamental level.

Summary of the PR:

Added the Python file to /inst/python following R package standards.
Each thread in parse_mbox.py now successfully calls exec/parsembox.R for one month's data.
Included an initial draft of the vignette.

Next Steps/To Do:

"R/exec scripts should be taking config files as input, and not its own set of parameters." (In hindsight: probably would have been easier to do this first)
Create a configuration YAML file to store parameters (e.g., paths, number of threads).
Replace hardcoded paths in the Python script with configurable parameters.
Enable user control over the num_threads parameter and establish a reasonable default.
Improve documentation and add more detailed comments throughout the code for better readability.

I’m open to feedback and suggestions for improvement.

carlosparadis commented 1 month ago

@daomcgill Hi Dao,

To be clear, the existing project configuration files should be passed to Kaiaulu R/exec (as they would even without the multi-thread. However, for sanity sake, the Python script can take arguments for the multi-thread portion.

The main reason of the project configuration files is to document parameters that aid in reproducibility. The number of threads used does not matter for reproducibility, so it is OK it is arguments for the Python script.

Once you are more finished with this example, we can expand to more complicated processing than just turn json into tables, which may include month files beyond just mbox.

Thanks!

sailuh / kaiaulu

Parallelization of the Refresh Pipeline #310

Purpose

Part 1 (Proof of Concept)

Process

Workflow

Task List

Libraries

Part 2

References

Codeface Temporal Configs

Temporal Mbox Parallel Processing

Cross-Dataset Parallel Processing

Git Parallel Processing

Summary of the PR:

Next Steps/To Do: