Open ahalterman opened 8 years ago
Downside is the overhead when you have a large corpus that contains mostly junk (e.g. recently worked with one generated by the data-provider-who-shall-not-be-named with 2.5M stories, only about 2% of which generated events). We could do this just as easily by doing the substitutions in a post-processing phase, right?
I had forgotten that the geocoding step in postprocess.py
hits the db again. Since that's also where it does the actor splitting, it would be easy to put a step in there where it geocodes the full article and updates the db, but only if an event has already been coded from that article.
Think about adding a pre-pipeline coding step that geocodes complete articles (rather than sentences) to the country. This would be useful for two things:
Because the pipeline operates at the sentence level, actually geocoding the articles would have to happen outside the pipeline. Changes to the pipeline would just be in order to use the new info.