rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
668 stars 229 forks source link

Profile crawling, specific reaction data crawling, adding error handling #56

Open laols574 opened 4 years ago

laols574 commented 4 years ago

ADDITIONALLY: On comments.py, there is a section of code that begins with "if back". This part is checking whether or not it needs to iterate upwards to get the rest of the comments in the replies. However, for some comments, like

https://mbasic.facebook.com/comment/replies/?ctoken=10162169751605725_10162170377070725&p=129&count=168&pc=1&ft_ent_identifier=10162169751605725&gfid=AQBjT1xFFeGcZxyW&refid=52&__tn__=R

which IS the first link visited from the main comment page because FB displays a middle comment on the main page due to its popularity. In order to prevent missing out on scraping these entries, you should change:

back = response.xpath('//div[contains(@id,"comment_replies_more_1")]/a/@href').extract()

to

back = response.xpath('//div[contains(@id,"comment_replies_more_2")]/a/@href').extract()

in order to get the algorithm to iterate forwards as well. After, you have to merge these two separately generated csv files. This ended up being the easiest solution for me, but it's definitely possible to be done within a single program