grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.97k stars 786 forks source link

Portuguese stopword file #76

Closed grangier closed 10 years ago

grangier commented 10 years ago

Text extraction in Portuguese doesn't play well du to a missing stopword file #60

grangier commented 10 years ago
>>> url  = 'http://tecnoblog.net/146800/producao-dispositivos-moveis-afetada-substitutos-para-metais-raros/'
>>> from goose import Goose
>>> g = Goose({'use_meta_language': False, 'target_language':'pt'})
>>> article = g.extract(url=url)
>>> article.cleaned_text[:150]
u'N\xe3o \xe9 novidade para ningu\xe9m que muitos objetos do nosso cotidiano t\xeam em sua composi\xe7\xe3o materiais que s\xe3o escassos e n\xe3o renov\xe1veis, mas um estudo con'
>>> 
fjorgemota commented 10 years ago

Perfect! I will test soon =)