grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 787 forks source link

<br> tags are mostly ignored #25

Closed psilva261 closed 11 years ago

psilva261 commented 11 years ago

Hi,

(Unfortunately) some sites rely on br tags for newlines, an example is:

http://allnewlyrics.com/only-one-lyrics-pj-morton-ft-stevie-wonder.html

The newlines are almost completely ignored...

Is it difficult to solve that? I already tried to understand the crawler. So far I've seen that text included inside br tags gets preserved. On the other hand in goose/outputformatters.py self.remove_fewwords_paragraphs(article) removes single
tags. Also the order of the br node seems to be changed at an early stage of the process. I wonder which point that is...

Cheers, Philip

georgepsarakis commented 11 years ago

Wouldn't be better/easier to perform a regex replace of <br>, <br /> with newlines?

grangier commented 11 years ago

There is a fork https://github.com/ChorHizzle/python-goose/commit/e6b41bc267efaa9c79a1a214278bc56a44deeb7b witch seems to handles newlines. I can't commit it since most of the tests are broken once applied. At the moment I'm not sure how to handle this.

Patches are welcome

psilva261 commented 11 years ago

@georgepsarakis I think that's what ChorHizzle's code is mostly doing

@grangier I've submitted a patch. I hope it is fine, I mostly took ChorHizzle's code. There were only too minor fixes necessary and I added an additional unit test for allnewlyrics.com

adisomani commented 9 years ago

This still looks like an issue. I can the see the code above in my goose installation. But with this url: http://timesofindia.indiatimes.com/tech/tech-news/Airtel-to-woo-data-users-through-free-internet/articleshow/45063462.cms? The br s still seem to be ignored. e.g.

url = "http://timesofindia.indiatimes.com/tech/tech-news/Airtel-to-woo-data-users-through-free-internet/articleshow/45063462.cms?"

article.cleaned_text[:400] u"NEW DELHI: Bharti Airtel , the country's biggest telecom operator, is set to offer free internet to subscribers for trial as the company looks to convert more users into 'internet users' to push up revenues and profits.Data is gradually turning out