Closed plammens closed 3 years ago
I saw that Article instances' raw_text seemed to be empty:
Article
raw_text
from pyrae import dle search_result = dle.search_by_word("selección") print(repr(search_result.articles[0]))
Article(id="XUE4F1v", lema="selección", raw_text=" ")
I investigated a bit and it looks like the issue is that in Article._parse_html, the raw_text was extracted as the last step: https://github.com/nachocho/pyrae/blob/4df5097e5769a9a47ddd4d3bff0120ccf0a1ad09/pyrae/core.py#L1116-L1143 but since each of the PageElement.append calls results in a call to PageElement.extract, this means the appended element is actually removed from the self._soup object in-place, so by the end of processing, the soup is empty and get_text() returns just a bunch of whitespace.
Article._parse_html
PageElement.append
PageElement.extract
self._soup
get_text()
So I moved the get_text call to the top of the function.
get_text
I saw that
Article
instances'raw_text
seemed to be empty:I investigated a bit and it looks like the issue is that in
Article._parse_html
, theraw_text
was extracted as the last step: https://github.com/nachocho/pyrae/blob/4df5097e5769a9a47ddd4d3bff0120ccf0a1ad09/pyrae/core.py#L1116-L1143 but since each of thePageElement.append
calls results in a call toPageElement.extract
, this means the appended element is actually removed from theself._soup
object in-place, so by the end of processing, the soup is empty andget_text()
returns just a bunch of whitespace.So I moved the
get_text
call to the top of the function.