Continuous Scraping - Githubissues

sergiotapia / magnetissimo

Web application that indexes all popular torrent sites, and saves it to the local database.

MIT License

3k stars 190 forks source link

Continuous Scraping #27

Closed waynerobinson closed 6 years ago

waynerobinson commented 7 years ago

Just curious, does this just continuously scrape all the sites pausing 100ms between each page, repeating again at the top?

I know the comment on https://github.com/sergiotapia/magnetissimo/blob/master/lib/crawler/thepiratebay.ex#L17 says 5 seconds, but it seems to be 1 * 1 * 100 == 100ms.

Seems excessive to do this over and over without a longer break between a complete crawl.

sergiotapia commented 7 years ago

It only starts at the top (initial_queue) when the queue itself is empty.

https://github.com/sergiotapia/magnetissimo/blob/master/lib/crawler/thepiratebay.ex#L29

If it has items in the queue, it processes those items first. And that 5 seconds comment is outdated. Shame! 🔔 I need to update it to reflect the real time between processing.

waynerobinson commented 7 years ago

I meant, does it just loop round and round the pages without a larger break in-between? The queue contains a list of pages to scrape for torrents correct?

If this server was to just run in the background, wouldn't it be attempting to download a page every 100ms for each of the sites during the entire time its operating?

sergiotapia commented 7 years ago

Hey @waynerobinson circling back to this ticket, I changed the way we're scraping time-wise, and it's much more site-friendly. #72 should land soon.

tchoutri commented 6 years ago

Yup', we are much less flooding the websites now :)