Open eborbath opened 5 years ago
Hi,
I was wondering, is there a way to feed the list with the post urls the page crawler downloads to the scraper of the comments? It seems to me that the structure of the scraper of the comments requires individual links which would require re-running the command for every single post. Perhaps making the two scrapers compatible and adding support for a csv list of links to the comments scaper could be a way to avoid that. Or is there an obvious way to do that which I am missing? Thanks!
I have the same question and you approach seems good to try
This was a design choice, when I first wrote fbcrawl I wanted CSVs only for certain posts. It's not a bad feature to mix the two crawlers and it's not that difficult to do, but I'm not planning to implement it anytime soon. If you do please open a PR, I'll be glad to merge new features, I'll keep this open anyway.
I actually implemented this today, you can now crawl comments either from a page (if you pass -a page
) or from a single post (using -a post
). I'll be updating the README soon.
The only problem is that when crawling an entire page sometimes the reply-to comments are not in order, I'm afraid that has to do with how scrapy handles concurrent requests. I tried playing a bit with the "priority" parameter that you can pass it to scrapy.Request and it has mitigated this a bit (at least the first page on replies are in order).
@rugantio is it possible to scrap the reactions on comments?
@RaghdaZiada it is possible, yes, have a look at the bottom of comments.py
Besides the substitution of yield new.load_item()
you will also have to change FEED_EXPORT_FIELDS
(at the beginning of comments.py) accordingly to the fields declared in items.py in the class CommentsItem.
There are three reasons why I didn't include the reactions by default:
1) The comments order is messed up. Scrapy asynchronous framework makes it difficult to correctly crawl the reply-to comments in order when the requests are nested. Although it's claimed that scrapy crawls in DFS order, this is not actually the case. I've tried to enforce DFS using the priority
parameter in the requests but it doesn't seem to be enough. Maybe the code I wrote is wrong, I would realy like someone more experienced to have a look at it, because I'm not actually a programmer, this is just a pet project.
2) With the facebook blockage in place, crawling the reactions adds a request more to each comment, slowing down the crawling process
3) The total reaction for a comment is a good indicator for engagement and, in those case where it's not sufficient, it's actually quite easy to modify the crawler to also get the reactions
@RaghdaZiada Actually now that I think of it, I can just put a boolean parameter called reactions
that can be turned on if the comments reactions have to be crawled. I'll try this over the week and see how it turns out.
@rugantio this would be really great I'm waiting for your updates hopefully the boolean parameter trick will work
Hi! Thanks for taking on implementing this idea! I have tried the new version and it seems to be working well. I just have difficulties with pairing the comments with the posts. Is there a way to do this? I thought the url could work, but that is in a different format than the url in the page csv and it is not always the post url. I guess one could get close to merging the two bits of information based on the date, but very often a page has multiple posts on the same date. Could this be solved somehow? I know it's not ideal, but perhaps also grabbing the post text once again could work?
@rugantio I want to know also if there is a way to extract the data of the page's community as I didn't find a way to do it so far
@rugantio @eborbath Last night I started thinking of another alternative , instead of changing the fbcrawl and comments spiders , I tried making some sort of configuration.py file to take posts urls that fbcrawl.py outputs and feed them to the comments.py in a loop to process all the posts in the classic way these spiders have been working in , like extracting data from an intermediate file or database , my problem here is the time that will be consumed to loop off all the posts and execute the comments spider many times given that we want to crawl a whole page.
@eborbath yes, I'll fix the URL field sometimes during the week. We should have two columns actually: a post_url
and a comment_url
.
@RaghdaZiada Yes, you can extract the community info (number of like and follows) from the mbasic interface at this address https://mbasic.facebook.com/PAGENAME/community some other infos are available at https://mbasic.facebook.com/PAGENAME/about of course this doesn't fit a CSV-style crawler, but these data are publicly available without having to log in, so you might as well just perform a simple GET
using requests
, urllib
and use lxml
to parse the correct fields (starting point being //div[@id='pages_msite_body_contents']
)
@Medmj Yes, I thought about doing something like this: 1) use fbcrawl on the page 2) use pandas to retrieve the url column and return the full url in a list 3) write a simple bash script that feeds on the list and launches the comments crawler in a serial way for point 3 you can also use scrapyd https://github.com/scrapy/scrapyd which is rather nice! In the end it was just quicker to integrate the two crawlers, too bad that it turned out to not crawl in perfect order. I wonder if someone that has more experience with scrapy/twisted could implement a proper DFS or if it's just not possible using this frameworks.
@Medmj Yes, I thought about doing something like this:
- use fbcrawl on the page
- use pandas to retrieve the url column and return the full url in a list
- write a simple bash script that feeds on the list and launches the comments crawler in a serial way for point 3 you can also use scrapyd https://github.com/scrapy/scrapyd which is rather nice! In the end it was just quicker to integrate the two crawlers, too bad that it turned out to not crawl in perfect order. I wonder if someone that has more experience with scrapy/twisted could implement a proper DFS or if it's just not possible using this frameworks.
thank you for introducing scrapyd I'm pressed by time and need quick solution so I will try to develop this one since it gives more relevant results.
Hello, Currently, all comments that are on the same page have same url and no comment_id. Are you planning to do such implementations since you mentioned adding "comment_url"? If not, I can try to contribute. Thanks!
Hi,
I was wondering, is there a way to feed the list with the post urls the page crawler downloads to the scraper of the comments? It seems to me that the structure of the scraper of the comments requires individual links which would require re-running the command for every single post. Perhaps making the two scrapers compatible and adding support for a csv list of links to the comments scaper could be a way to avoid that. Or is there an obvious way to do that which I am missing? Thanks!