r1b / news-wires

Wire services for everyone
https://news.r1b.solutions
BSD 2-Clause "Simplified" License
17 stars 2 forks source link

Improve stripping of source from parsed headline #4

Closed r1b closed 7 years ago

r1b commented 7 years ago

The problem:

Many headlines (esp. from title tags) have a prelude / postlude with a delimeter and the name of the site e.g:

Man Dies in Bathtub - Associated Press
Reuters | Dormant Volcano Found in Bavarian Woman's Backyard

I have a heuristic that strips these out but it is unreliable. Further, sometimes you actually want the info after the delimeter, e.g:

"The U.S.A has a fake news problem" - Trump
r1b commented 7 years ago

For now I am going to remove the code that does this. The only thing I will do to the headline is remove leading / trailing whitespace.

r1b commented 7 years ago

I will also add headline selectors for all sources - should help mitigate the issue.