kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.4k stars 627 forks source link

Duplicate post text #293

Closed masalha-alaa closed 3 years ago

masalha-alaa commented 3 years ago

For some reason I'm getting the post text doubled. For example:

for post in get_posts('verge', pages=3, extra_info=True, timeout=60, options={'posts_per_page': 200}):
    # ...

post['text']: The e-commerce giant’s biggest entertainment acquisition yet

The e-commerce giant’s biggest entertainment acquisition yet

THEVERGE.COM Amazon buys MGM for $8.45 billion The studio behind the James Bond and Rocky franchises.

THEVERGE.COM Amazon buys MGM for $8.45 billion

post['post_text']: 'The e-commerce giant’s biggest entertainment acquisition yet

The e-commerce giant’s biggest entertainment acquisition yet

THEVERGE.COM Amazon buys MGM for $8.45 billion The studio behind the James Bond and Rocky franchises.'

mennatallah644 commented 3 years ago

I am facing the same problem

masalha-alaa commented 3 years ago

I examined the raw html response, and found that there's only 1 appearance of the text:

<div><span><p>A big change for the series</p></span></div>

But when the parser extract_text() calls element.find('p, header, span[role=presentation], .story_body_container>div'), two occurrences are found:

Run [(node.tag, node.text) for node in element.find('p, header, span[role=presentation], .story_body_container>div')] to produce:

[('header', 'The Verge\n18 mins ·'), ('div', 'A big change for the series'), ('p', 'A big change for the series'), ('div', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022\nOld time Pokémon.'), ('header', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022')]

If we remove either p or .story_body_container>div from the search pattern, the result looks good:

Remove .story_body_container>div: [(node.tag, node.text) for node in element.find('p, header, span[role=presentation]')]

[('header', 'The Verge\n18 mins ·'), ('p', 'A big change for the series'), ('header', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022')]

Remove p: [(node.tag, node.text) for node in element.find('header, span[role=presentation], .story_body_container>div')]

[('header', 'The Verge\n18 mins ·'), ('div', 'A big change for the series'), ('div', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022\nOld time Pokémon.'), ('header', 'THEVERGE.COM\nOpen-world RPG Pokémon Legends: Arceus is launching January 28th, 2022')]

But the way extract_text() is implemented, the second result is not favorable, because it wouldn't stop appending until reaching the second 'header'.

neon-ninja commented 3 years ago

Hi, this should fix the problem https://github.com/kevinzg/facebook-scraper/commit/da3efa99a0cb6e490de82b1d379dd7bb61dcdcd5

masalha-alaa commented 3 years ago

Looking good now, thanks.