Closed terry2tan closed 7 years ago
@terry2tan, I fixed the bug in the master; if you give it a try and confirm it works, I'll make sure to update the distribution in PyPi.
To install from pip:
pip install git+https://github.com/linwoodc3/gdeltPyR
Then, let me know if the same code above works. I had the same error as you, installed a bug fix version of gdelt, and it worked. Want to make sure you get the same results.
thanks @linwoodc3 , I reinstalled as the way you showed to me. My OS is win7 64bit, I tested it again, it show the error information as below:
cmd = get_command_line() + [rhandle]
File "C:\Users\XXXX\Anaconda2\lib\multiprocessing\forking.py", line 358, in get_command_line is not going to be frozen to produce a Windows executable.''') RuntimeError: Attempt to start a new process before the current process has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable
Ah; Windows! Let me research it; I'm not good with windows so will look for some help to resolve the problem. I did a test and the single interval pull works on windows. but, the multi-interval or multi-day pull fails because multiprocessing works different on windows machines.
I am posting a question on stackoverflow for some help on fixing this. I don't work with Windows so this is a problem for me.
@terry2tan ,I did a new push and tested on a Windows machine before the push; it should work now. Let me know if you can run the following without errors:
# Version 2 queries
gd2 = gdelt.gdelt(version=2)
# Full day pull, output to pandas dataframe, events table
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
print(len(results))
@terry2tan , I've done some tests and it works. Closing this unless you're still having issues. Attached a screenshot of gdeltPyR
's version 2 query working on a Windows machine.
If you (or anyone with Windows) has issues, feel free to add a comment.
hi how many months data i can retrieve from results = gd.Search(['2016 10 19','2016 10 22'] by calling search function , if there is any other better method to retrieve data for at least three years please share me.
@manojgali. The best method is to use the Search method, but this will be a LOT of data and could take a long time to pull; possibly TBs sizes and hours when it's all done. You can optimize the query by using multiprocessing.
GDELT version 2 can only query from Feb 2015 to present. So, if you want to query days earlier than Feb 2015, you need to use GDELT version 1. You set that up like this:
import gdelt
gd = gdelt.gdelt(version=1)
Now, since this will be such a big query, I advise iterating through dates, running the query on a single day, write that day to disk, and proceed to other days. But, if you want to do a big pull covering two date ranges, I'll give you a suggestion. I recommend you install futures
if you have Python 2 (it is installed by default for Python 3). Then, this is the best way to pull 3 years of data:
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt
# set up gdeltpyr for version 1
gd = gdelt.gdelt(version=1)
# multiprocess the query
e = ProcessPoolExecutor()
# generic function to pull and write data to disk based on date
def getter(x):
try:
date = x.strftime('%Y%m%d')
d = gd.Search(date)
d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
except:
pass
# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
That code above will write a file to your disk for each day of GDELT data. If you decide to use version=2
, you will need to add the coverage=2
flag like so:
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt
# set up gdeltpyr for version 2
gd = gdelt.gdelt(version=2)
# multiprocess the query
e = ProcessPoolExecutor()
# generic function to pull and write data to disk based on date
def getter(x):
try:
date = x.strftime('%Y%m%d')
d = gd.Search(date, coverage=True)
d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
except:
pass
# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))
If you want to read all the data for processing, you should install the dask
library to handle large datasets. Now to read all the data you loaded:
import dask.dataframe as dd
# read all the gdelt csvs into one dataframe
df = dd.read_csv('*_gdeltdata.csv')
Now you can run pandas
like operations using dask
. See the dask
documentation for more information on using dask. Outside of pulling the data though, anything else you need help on is outside the scope of the gdeltPyR
library. It is just used to access data.
Traceback (most recent call last): File "D:\XXX\coding\gdelt\gdeltPyR.py", line 12, in
results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True)
File "C:\Users\XXX\Anaconda2\lib\site-packages\gdelt\base.py", line 290, in Search
p
NameError: global name 'p' is not defined