nachocho / pyrae

Perform searches against the RAE (Real Academia Española) dictionary.
Other
15 stars 3 forks source link

fix: Extract article's raw text before extracting soup's elements #8

Closed plammens closed 3 years ago

plammens commented 3 years ago

I saw that Article instances' raw_text seemed to be empty:

from pyrae import dle

search_result = dle.search_by_word("selección")
print(repr(search_result.articles[0]))
Article(id="XUE4F1v", lema="selección", raw_text="

")

I investigated a bit and it looks like the issue is that in Article._parse_html, the raw_text was extracted as the last step: https://github.com/nachocho/pyrae/blob/4df5097e5769a9a47ddd4d3bff0120ccf0a1ad09/pyrae/core.py#L1116-L1143 but since each of the PageElement.append calls results in a call to PageElement.extract, this means the appended element is actually removed from the self._soup object in-place, so by the end of processing, the soup is empty and get_text() returns just a bunch of whitespace.

So I moved the get_text call to the top of the function.