linwoodc3 / gdeltPyR

Python based framework to retreive Global Database of Events, Language, and Tone (GDELT) version 1.0 and version 2.0 data.
https://linwoodc3.github.io/gdeltPyR/
GNU General Public License v3.0
203 stars 53 forks source link

NameError: global name 'p' is not defined #12

Closed terry2tan closed 7 years ago

terry2tan commented 8 years ago

Traceback (most recent call last): File "D:\XXX\coding\gdelt\gdeltPyR.py", line 12, in results = gd.Search(['2016 10 19','2016 10 22'],table='events',coverage=True) File "C:\Users\XXX\Anaconda2\lib\site-packages\gdelt\base.py", line 290, in Search p NameError: global name 'p' is not defined

linwoodc3 commented 8 years ago

@terry2tan, I fixed the bug in the master; if you give it a try and confirm it works, I'll make sure to update the distribution in PyPi.

To install from pip:

pip install git+https://github.com/linwoodc3/gdeltPyR

Then, let me know if the same code above works. I had the same error as you, installed a bug fix version of gdelt, and it worked. Want to make sure you get the same results.

terry2tan commented 8 years ago

thanks @linwoodc3 , I reinstalled as the way you showed to me. My OS is win7 64bit, I tested it again, it show the error information as below:

cmd = get_command_line() + [rhandle]

File "C:\Users\XXXX\Anaconda2\lib\multiprocessing\forking.py", line 358, in get_command_line is not going to be frozen to produce a Windows executable.''') RuntimeError: Attempt to start a new process before the current process has finished its bootstrapping phase.

        This probably means that you are on Windows and you have
        forgotten to use the proper idiom in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce a Windows executable
linwoodc3 commented 8 years ago

Ah; Windows! Let me research it; I'm not good with windows so will look for some help to resolve the problem. I did a test and the single interval pull works on windows. but, the multi-interval or multi-day pull fails because multiprocessing works different on windows machines.

I am posting a question on stackoverflow for some help on fixing this. I don't work with Windows so this is a problem for me.

http://stackoverflow.com/questions/40487730/need-help-doing-multiprocessing-on-windows-for-my-pypi-package

linwoodc3 commented 7 years ago

@terry2tan ,I did a new push and tested on a Windows machine before the push; it should work now. Let me know if you can run the following without errors:

# Version 2 queries
gd2 = gdelt.gdelt(version=2)

# Full day pull, output to pandas dataframe, events table
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
print(len(results))
linwoodc3 commented 7 years ago

@terry2tan , I've done some tests and it works. Closing this unless you're still having issues. Attached a screenshot of gdeltPyR's version 2 query working on a Windows machine.

If you (or anyone with Windows) has issues, feel free to add a comment.

screen shot 2016-11-24 at 12 47 53 pm

manojgali commented 6 years ago

hi how many months data i can retrieve from results = gd.Search(['2016 10 19','2016 10 22'] by calling search function , if there is any other better method to retrieve data for at least three years please share me.

linwoodc3 commented 6 years ago

@manojgali. The best method is to use the Search method, but this will be a LOT of data and could take a long time to pull; possibly TBs sizes and hours when it's all done. You can optimize the query by using multiprocessing.

GDELT version 2 can only query from Feb 2015 to present. So, if you want to query days earlier than Feb 2015, you need to use GDELT version 1. You set that up like this:

import gdelt

gd = gdelt.gdelt(version=1)

Now, since this will be such a big query, I advise iterating through dates, running the query on a single day, write that day to disk, and proceed to other days. But, if you want to do a big pull covering two date ranges, I'll give you a suggestion. I recommend you install futures if you have Python 2 (it is installed by default for Python 3). Then, this is the best way to pull 3 years of data:


from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt

# set up gdeltpyr for version 1
gd = gdelt.gdelt(version=1)

# multiprocess the query
e = ProcessPoolExecutor()

# generic function to pull and write data to disk based on date
def getter(x):
    try:
        date = x.strftime('%Y%m%d')
        d = gd.Search(date)
        d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
    except:
        pass

# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))

That code above will write a file to your disk for each day of GDELT data. If you decide to use version=2, you will need to add the coverage=2 flag like so:


from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt

# set up gdeltpyr for version 2
gd = gdelt.gdelt(version=2)

# multiprocess the query
e = ProcessPoolExecutor()

# generic function to pull and write data to disk based on date
def getter(x):
    try:
        date = x.strftime('%Y%m%d')
        d = gd.Search(date, coverage=True)
        d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
    except:
        pass

# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))

If you want to read all the data for processing, you should install the dask library to handle large datasets. Now to read all the data you loaded:

import dask.dataframe as dd

# read all the gdelt csvs into one dataframe
df = dd.read_csv('*_gdeltdata.csv')

Now you can run pandas like operations using dask. See the dask documentation for more information on using dask. Outside of pulling the data though, anything else you need help on is outside the scope of the gdeltPyR library. It is just used to access data.