Project OCEAN is an open science collaboration focused on understanding the open source ecosystems creating datasets that enable research and forming a clear understanding of the state of open source communities.
Fix errors with getting all mailing list and partial date load to run.
One issue from previous PR was - was missing on a couple parameters in build
Fixed how dates are set so that it stays in the current start month if the timespan is large enough
Fixed related tests
Added comments to help clarify what the code is doing as well as to
Fixed google groups load so that it doesn't look at all the emails if we are not going for all the dates | its a hacky solution that needs to be revisited because it takes the first 15% of the total pages available. This is assuming if you are not loading all dates, you are loading a more recent history of dates but if you enter start and end that is further in history, it is not setup to appropriately catch this.
Added default workerNum otherwise it freezes
There is an error still for : connection reset by peer that I think is there are too many requests to the google groups server and potentially a lower worker num will help.
Fix errors with getting all mailing list and partial date load to run.
One issue from previous PR was - was missing on a couple parameters in build
Fixed how dates are set so that it stays in the current start month if the timespan is large enough
Fixed related tests
Added comments to help clarify what the code is doing as well as to
Fixed google groups load so that it doesn't look at all the emails if we are not going for all the dates | its a hacky solution that needs to be revisited because it takes the first 15% of the total pages available. This is assuming if you are not loading all dates, you are loading a more recent history of dates but if you enter start and end that is further in history, it is not setup to appropriately catch this.
Added default workerNum otherwise it freezes
There is an error still for : connection reset by peer that I think is there are too many requests to the google groups server and potentially a lower worker num will help.