Downloader/Refresher Logic Overlap

carlosparadis commented 8 months ago

I'd like you to give some thought on how much duplicated code you will end up with your refresher logic among Bugzilla, JIRA, and GitHub, and Mbox.

Could you, for example, create a "refresher_function" that can take a few parameters and handle Bugzilla, JIRA, and GitHub logic all in one? How much code are you duplicating by creating a bugzilla_refresh, jira_refresh, github_refresh, mbox_refresh? I'd expect you would still need them, but part of its logic could possibly be reused.

Consider breaking the refresh in conceptual steps:

Read the list of files in a directory (check the HADOOP anomaly case on the issue to see if it won't break this without additional information)
Locate the filename with latest timestamp
Return filepath

I would imagine all your downloader refreshers could use the same function capturing that logic rather than each having them. The unique behavior here lies in a) accessing the file to obtain the most current timestamp, and b) how to update the folder thereafter.

anthonyjlau commented 8 months ago

As we discussed on call, it is possible to do create an overarching refresher function that covers Bugzilla, JIRA, GitHub, and Mbox. However, the timestamp endpoints that each one uses is different. Mbox is specific up to a month, JIRA is specific to the minute, and GitHub and Bugzilla is specific to the second. This makes it difficult to combine because of the different formats that are needed. Therefore, we agreed that we will not create a general refresh function for this milestone. This function may be addressed at a later time.

Ssunoo2 commented 8 months ago

Some of the overlap involved has already been made into a function. Notably the parse_jira_latestdate() function that iterates through filenames to return the filename that contains the latest date. Assuming that the naming convention is Project...[Unix time of earliest issue]\[Unix time of most recent issue].json, then this can be used across refreshers.

Beyond this, extracting the value of the latest date often differs between downloaders as the variable that contains the latest date may differ from project to project. For instance, the JIRA downloader uses a parameter called 'created' and the Github Downloader uses a parameter called 'created_at'. Nesting of these values may also differ.

Ssunoo2 commented 7 months ago

General idea from our 4/19 notes is that if two functions use the same api call endpoint, you should make a separate function that calls the API, accepting an optional query parameter. The refresher and the download by date functions are essentially wrapper functions that construct the query and then call the function.

# Wrapper Function
Download_by_date(...,lower bound, upperbound,){
# Construct the query
If lowerbound and not upperbound
Construct query > lowerbound
If uppderbound and not lowerbound
Construct query <upperbound
If upperbound and lowerbound
Construct query >lowerbound & <upperbound
# Call the api function
Bugzilla_api_call_function(...,query,...)
}

# Wrapper Function
Refresher_function(...,filepath,...){
# Construct the query
Get the greatest date from the file path
Construct query >created_date
# Call the api function
bugzilla_api_call_function(...,query,...)
}

# API Call function
bugzilla_api_call_function(...,query,...){
If (query) {
        Append query to api call
}
Call API
}

carlosparadis commented 1 week ago

Closing this, since the downloaders ideated here are finally getting merged.

sailuh / kaiaulu

Downloader/Refresher Logic Overlap #290