Closed masalha-alaa closed 3 years ago
I am facing the same problem
I examined the raw html response, and found that there's only 1 appearance of the text:
<div><span><p>A big change for the series</p></span></div>
But when the parser extract_text()
calls element.find('p, header, span[role=presentation], .story_body_container>div')
, two occurrences are found:
Run [(node.tag, node.text) for node in element.find('p, header, span[role=presentation], .story_body_container>div')]
to produce:
[('header', 'The Verge\n18 mins ·'), ('div', 'A big change for the series'), ('p', 'A big change for the series'), ('div', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022\nOld time Pokémon.'), ('header', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022')]
If we remove either p
or .story_body_container>div
from the search pattern, the result looks good:
Remove .story_body_container>div
:
[(node.tag, node.text) for node in element.find('p, header, span[role=presentation]')]
[('header', 'The Verge\n18 mins ·'), ('p', 'A big change for the series'), ('header', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022')]
Remove p
:
[(node.tag, node.text) for node in element.find('header, span[role=presentation], .story_body_container>div')]
[('header', 'The Verge\n18 mins ·'), ('div', 'A big change for the series'), ('div', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022\nOld time Pokémon.'), ('header', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022')]
But the way extract_text()
is implemented, the second result is not favorable, because it wouldn't stop appending until reaching the second 'header'.
Hi, this should fix the problem https://github.com/kevinzg/facebook-scraper/commit/da3efa99a0cb6e490de82b1d379dd7bb61dcdcd5
Looking good now, thanks.
For some reason I'm getting the post text doubled. For example:
post['text']: The e-commerce giant’s biggest entertainment acquisition yet
The e-commerce giant’s biggest entertainment acquisition yet
THEVERGE.COM Amazon buys MGM for $8.45 billion The studio behind the James Bond and Rocky franchises.
THEVERGE.COM Amazon buys MGM for $8.45 billion
post['post_text']: 'The e-commerce giant’s biggest entertainment acquisition yet
The e-commerce giant’s biggest entertainment acquisition yet
THEVERGE.COM Amazon buys MGM for $8.45 billion The studio behind the James Bond and Rocky franchises.'