Closed philip-schrodt closed 10 years ago
Google news links are in the form: http://news.google.com/news/url sa=t&fd=R&usg=AFQjCNEL7YZaEjLslibnY_bWBCXTG8ym3Q&url=http://www.torontosun.com/2014/02/03/broncos-canuck-feels-heat-from-media
with the source article after &url= in the link above. The way I have it set up now is to split all google news links to the source article link and just keep the source article link. Is there any reason why we would want to entire the whole new.google link?
So all links are less than the 255-char input, then? Google news sites were the only ones that went over?
Seems to be resolved now. Can reopen if need be.
TABARI has a 255-char input line limit and a few of the Google URLs go beyond that length, so I truncated these in mongo_formatter.py. When we switch to the PETR input format we can eliminate this. It only applies to a very small number of cases but was enough to crash TAB.