openeventdata / phoenix_pipeline

Turning news into events since 2014.
MIT License
50 stars 33 forks source link

Need to check for datelines by looking for mdash in Mongo.formatter.py #3

Closed philip-schrodt closed 10 years ago

philip-schrodt commented 10 years ago

Mongo.formatter.py: The presence of an "m-dash" (Unicode E2) in the first 32 or characters of a story (particularly if the initial word is all-caps) usually signals a dateline, e.g.

"TAIPEI, Taiwan — "

and the text from the beginning of the story to the m-dash location + 1 could be eliminated. However, I haven't quite figured out the correct incantations to keep Python happy with such a check, though u"\xe2" is probably the way to designate the character.

myi100 commented 10 years ago

"\xe2\x80\x93" is what we were looking for. Updated mongo.formatter.py.

johnb30 commented 10 years ago

Closed by #20.