Closed psilva261 closed 11 years ago
Wouldn't be better/easier to perform a regex replace of <br>, <br />
with newlines?
There is a fork https://github.com/ChorHizzle/python-goose/commit/e6b41bc267efaa9c79a1a214278bc56a44deeb7b witch seems to handles newlines. I can't commit it since most of the tests are broken once applied. At the moment I'm not sure how to handle this.
Patches are welcome
@georgepsarakis I think that's what ChorHizzle's code is mostly doing
@grangier I've submitted a patch. I hope it is fine, I mostly took ChorHizzle's code. There were only too minor fixes necessary and I added an additional unit test for allnewlyrics.com
This still looks like an issue. I can the see the code above in my goose installation. But with this url: http://timesofindia.indiatimes.com/tech/tech-news/Airtel-to-woo-data-users-through-free-internet/articleshow/45063462.cms? The br s still seem to be ignored. e.g.
article.cleaned_text[:400] u"NEW DELHI: Bharti Airtel , the country's biggest telecom operator, is set to offer free internet to subscribers for trial as the company looks to convert more users into 'internet users' to push up revenues and profits.Data is gradually turning out
Hi,
(Unfortunately) some sites rely on br tags for newlines, an example is:
http://allnewlyrics.com/only-one-lyrics-pj-morton-ft-stevie-wonder.html
The newlines are almost completely ignored...
Is it difficult to solve that? I already tried to understand the crawler. So far I've seen that text included inside br tags gets preserved. On the other hand in goose/outputformatters.py self.remove_fewwords_paragraphs(article) removes single
tags. Also the order of the br node seems to be changed at an early stage of the process. I wonder which point that is...
Cheers, Philip