scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

Project Status? #409

Open psdon opened 3 years ago

psdon commented 3 years ago

It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

leopucci commented 3 years ago

Same feeling here. Should I invest my time using it? Final version contains fixed bugs but not released version for them.

getorca commented 3 years ago

also wondering the same.

aryaniyaps commented 3 years ago

Any updates on this?

leopucci commented 3 years ago

I think that the lack of update make it clear the status of it.

Em sáb, 22 de mai de 2021 04:05, Aryan Iyappan @.***> escreveu:

Any updates on this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapinghub/frontera/issues/409#issuecomment-846366038, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJFZV45S3DU6BNJZTLOZAXDTO5JSTANCNFSM4UQ7NHCA .

aryaniyaps commented 3 years ago

thanks for the reply! I am considering moving onto some other library or implementing my own solution.

leopucci commented 3 years ago

Try scrapy-cluster... I moved away from Frontera to it.

Em sáb, 22 de mai de 2021 22:48, Aryan Iyappan @.***> escreveu:

thanks for the reply! I am considering moving onto some other library or implementing my own solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scrapinghub/frontera/issues/409#issuecomment-846488104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJFZV42MSWRCGJ6SZI6WRKDTPBNGRANCNFSM4UQ7NHCA .

aryaniyaps commented 3 years ago

I ended up implementing my own distributed crawler based on this paper. https://nlp.stanford.edu/IR-book/pdf/20crawl.pdf

It talks about creating an URL frontier that enqueues and manages URLs. I would just like to give some tips to anyone who would look at this in the future.

While adapting to scrapy, the whole concept of "back queues" mentioned in the paper can be discarded. Scrapy implements this using the downloader (more precisely, "download slots"). That's already taken care of. While scaling, you might need to make your own downloader which uses redis to take care of slots maybe (the default scrapy downloader stores slots in memory and can be very inefficient).

That said, the other thing we need to do is the "front queues". The best place to implement this is the scheduler.

Say you have N number of front queues, push each request that comes into one of the queues according to it's priority. (If the request has a priority of 3, it will be pushed into queue number 3).

While getting the next request, use weighted randoms to pick one of the front queues, and pop the first request in the queue. Each front queues must be FIFO queues. This should be such that important requests flow more frequently.

The next part is the dupefilter. I am storing dupefilter keys in redis and also set them to expire after a certain amount of time. If the request is already in the filter, I reject it.

This gives a more scalable frontier. I believe this is the concept which frontera is about, but they've implemented it differently.

davidsu-citylitics commented 1 year ago

@aryaniyaps great insight, thanks for sharing!