openeventdata / phoenix_pipeline

Turning news into events since 2014.
MIT License
50 stars 33 forks source link

Truncated URLs in mongo_formatter.py #12

Closed philip-schrodt closed 10 years ago

philip-schrodt commented 10 years ago

TABARI has a 255-char input line limit and a few of the Google URLs go beyond that length, so I truncated these in mongo_formatter.py. When we switch to the PETR input format we can eliminate this. It only applies to a very small number of cases but was enough to crash TAB.

myi100 commented 10 years ago

Google news links are in the form: http://news.google.com/news/url sa=t&fd=R&usg=AFQjCNEL7YZaEjLslibnY_bWBCXTG8ym3Q&url=http://www.torontosun.com/2014/02/03/broncos-canuck-feels-heat-from-media

with the source article after &url= in the link above. The way I have it set up now is to split all google news links to the source article link and just keep the source article link. Is there any reason why we would want to entire the whole new.google link?

johnb30 commented 10 years ago

So all links are less than the 255-char input, then? Google news sites were the only ones that went over?

johnb30 commented 10 years ago

Seems to be resolved now. Can reopen if need be.