Closed linwoodc3 closed 7 years ago
Skip download via multiprocessing altogether. Rather, use the multiprocessing workers to read pandas dataframes directly from the url list I build based on the dates passed. This should significantly save time.
Sample
import pandas as pd
df = pd.read_csv('/Users/linwood/Downloads/20150218230000.export.CSV.zip', compression='zip',sep='\t',header=None)
Check if blaze or dask can support this to handle the big data solution.
multiprocessing alone has sped this up orders of magnitude; concepts in notebook. push before week is out and implement in core code.
All features included in 787f2f74ea9290d201eb4166edd392ce52782669
We lose a good 10-20 seconds on the download of the masterlist. Just use the date format and time strings to pull the data based on the current time and date.