semantic-systems / coypu-feeds

Apache License 2.0
0 stars 0 forks source link

Add Event-Codes from the CAMEO event taxonomy to output #3

Closed as311-ops closed 1 year ago

as311-ops commented 1 year ago

The GDELT 1.0 and 2.0 Event Databases use the CAMEO event taxonomy, which is a collection of more than 300 types of events organized into a hierarchical taxonomy and recorded in the files as a numeric code. These tab-delimited lookup files contain the human-friendly textual labels for each of those codes to make it easier to work with the data for those who have not previously worked with CAMEO.

See https://www.gdeltproject.org/data.html#documentation

EventCodes: https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt

So if available, add CAMEOEVENTCODE and EVENTDESCRIPTION to the output of the API

junbohuang commented 1 year ago

I don't think the output of GDELT API 2.0 that we are using contains such information. GDELT Event Database (via Google BigQuery or downloading the DB locally) does have this information by some automatic parser that is not specified. I would say this is quite difficult, since the DB is massive (one year alone 2.5TB). We will need a guy/server who can solve the storage/loading problem.

That being said, I do not think we can have this information in this repo. What do you think?

GDELT API 2.0: https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ and https://github.com/alex9smith/gdelt-doc-api

junbohuang commented 1 year ago

an idea ✨: since there is not event action detection dataset, perhaps we can get a subset of the GDELT Event Database and create a dataset for that.

Available information in GDELT Event Database: event types, actors, temporal and geo-information and a list of articles describing each event.

junbohuang commented 1 year ago

closing this issue since the crawler outputs only raw text and meta-information about the documents (news), no event type will be given.