rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
661 stars 229 forks source link

scraping everything but video posts #6

Closed slakat closed 5 years ago

slakat commented 5 years ago

I've been trying to find why the script only extract posts if they are images or texts (even 360º posts worked just fine), when the posts are videos it doesn't download it along the rest of the data. At first, I thought the html was different or the attribute "top_level_post_id" didn't apply to them, but i looked into it in the mbasic fb feed and it doesn't seem the case.

Any idea? Thanks!

rugantio commented 5 years ago

Thank you for the heads up! There seems to be a problem with scrapy regarding videos, the response that you get from the request is stripped of them. You can try to check that it's not behaving how it should inserting this snippet immediately under def parse_page before the for cycle on the post elements.

        from scrapy.utils.response import open_in_browser
        open_in_browser(response)

This will open the webpage crawled as scrapy as received it, before any parsing. On the pages that I've tried scrapy seems to have skipped the video posts, this is why they don't appear in the final csv. Can you confirm that you get the same behavior?

slakat commented 5 years ago

Yes, that's exactly what happened. Scrapy receive everything but videos, the open webpages in the browser are only non video posts. At least, now we know what we're trying to fix. still weird, I'm gonna look around for the problem with scrapy, i can't remember if this behaviour was there since the beginning. If anyone got around this, please share it with us 😺

rugantio commented 5 years ago

Incredibly it was sufficient to change the UserAgent to have the right complete response and moreover we don't need to pass it as a header in the Request, because Scrapy already implements an option to pass a custom UserAgent for each request. In settings.py you will find now this option set:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

this UserAgent is the one used by tor-browser (to keep it the most general). I suggest you to pull again the repo because I also fixed some parsers on the date field and on the text field that were broken. Please confirm that you can retrieve the fields from the videos, so that I can close the issue!

slakat commented 5 years ago

ooh, it was a simple fix in the end with the user agent, smart, i didn't think of change the customize option for something more general.

I checked and everything works just fine now, it retrieve all the fields even from video posts. Thank you for all the help!