Remove masterlist download altogether; just use date to download - Githubissues

linwoodc3 / gdeltPyR

Python based framework to retreive Global Database of Events, Language, and Tone (GDELT) version 1.0 and version 2.0 data.

https://linwoodc3.github.io/gdeltPyR/

GNU General Public License v3.0

197 stars 52 forks source link

Remove masterlist download altogether; just use date to download #4

Closed linwoodc3 closed 7 years ago

linwoodc3 commented 7 years ago

We lose a good 10-20 seconds on the download of the masterlist. Just use the date format and time strings to pull the data based on the current time and date.

linwoodc3 commented 7 years ago

Skip download via multiprocessing altogether. Rather, use the multiprocessing workers to read pandas dataframes directly from the url list I build based on the dates passed. This should significantly save time.

linwoodc3 commented 7 years ago

Sample

import pandas as pd
df = pd.read_csv('/Users/linwood/Downloads/20150218230000.export.CSV.zip', compression='zip',sep='\t',header=None)

linwoodc3 commented 7 years ago

Check if blaze or dask can support this to handle the big data solution.

linwoodc3 commented 7 years ago

multiprocessing alone has sped this up orders of magnitude; concepts in notebook. push before week is out and implement in core code.

linwoodc3 commented 7 years ago

All features included in 787f2f74ea9290d201eb4166edd392ce52782669