linwoodc3 / gdeltPyR

Python based framework to retreive Global Database of Events, Language, and Tone (GDELT) version 1.0 and version 2.0 data.
https://linwoodc3.github.io/gdeltPyR/
GNU General Public License v3.0
203 stars 53 forks source link

ENH: Add support for distinction between native-english and translated-to-english datasets with GDELT #27

Closed pietermarsman closed 7 years ago

pietermarsman commented 7 years ago

Added a parameter translation to Seach and use it in urlBuilder to get paths to the translated files.

This fixes #26 .

linwoodc3 commented 7 years ago

Thanks for the merge request. Reviewing this now.

linwoodc3 commented 7 years ago

I'm getting errors when trying to use this @pietermarsman . the unittests pass but trying to query gives errors. There appears to be a difference in column lengths; english has 62 columns. Translingual events appears to have 61. Haven't checked the other tables yet (mentions, gkg).

pietermarsman commented 7 years ago

What do you mean by:

trying to query gives errors

Maybe you can write a unittest that tests the wrong behavior?

pietermarsman commented 7 years ago

I am not able to replicate the different number of columns.

My output:

>>> gdelt.gdelt().Search("2017 07 08", translation=False).shape
(1281, 62)
>>> gdelt.gdelt().Search("2017 07 08", translation=True).shape
(576, 62)
linwoodc3 commented 7 years ago

@pietermarsman I figured it out. It was not your PR that caused the problem, it was the library overall and GDELT service. The main problem I had to fix was adding an exception for queries that returned zero data. For example, if you run this query:

checked = gd.Search('2017 Jul 27', translation=True)

it will recreate the error I saw. This is explaining more than you care about, but the problem is the GDELT (the service) does not have a url for http://data.gdeltproject.org/gdeltv2/20170727234500.gkg.csv.zip so it returns zero data. It looks like GDELT (the service) went down or had an error on some days and failed to provide a news file.