rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
661 stars 229 forks source link

Duplicates and low number of posts for past years #26

Closed ebergam closed 5 years ago

ebergam commented 5 years ago

Hi @rugantio If I try to go back in time and get the posts for previous years, I get very few posts. For instance, if I scrape https://www.facebook.com/Repubblica, until 2013, I obtain very few posts for years before 2018, and many posts are actually duplicates. Do you experience the same behavior?

Thanks,

rugantio commented 5 years ago

Hello @ebergam, For the sake of debugging, I copied my test on Repubblica here https://pastebin.com/M3fKq8Ee Have a look at line 686, here's where problems begin. The page that was retrieved clicking on "more" doesn't have the "more" link, so fbcrawl clicks on "2019" (not actually 2018) once - then it goes back to looking for the "more" link to follow - causing some duplicates that should be filtered by dupefilter. Again, at line 796 fbcrawl clicks on "2018" and so on, causing some months of gap. Assuming that Repubblica has a consistence publishing frequency it also looks like only a handful of posts are retrieved each day (maybe sampled?), compared to the number of posts of the last days. All this behavior depends entirely on fb's mbasic buggy interface, I don't think there is much I can do about that.

ebergam commented 5 years ago

@rugantio thanks a lot, this replies perfectly to my doubts. Indeed, it might be a sample of the posts. I guess on the effectiveness of scraping past posts there is not much doable, whereas on a forward-collection approach it still looks the technique can yield good results. Thanks again for your very prompt reply and your work!