Add nyaa.si crawler - Githubissues

sergiotapia / magnetissimo

Web application that indexes all popular torrent sites, and saves it to the local database.

MIT License

2.99k stars 187 forks source link

Add nyaa.si crawler #76

Closed skwerlman closed 7 years ago

skwerlman commented 7 years ago

This PR adds a crawler for nyaa.si, a clone of the recently deceased nyaa.se.

Site: https://nyaa.si Source: https://github.com/nyaadevs/nyaa

The dependency on floki was changed to version 0.17.2 because of :nth-child(n) support in Floki.find

ThatLurker commented 7 years ago

Could you also edit the readme file and add it to the list in the end

coline-carle commented 7 years ago

naa.si is supporting rss, prefixing your search by https://nyaa.si/rss , Paring xml with only metadata may be better bet than parsing the html page

skwerlman commented 7 years ago

@pahakalle Done.

@wow-sweetlie The issue with using rss here is that it only lists the 75 most recent torrents, about 1/50th of what it (and most of the other crawlers) currently scrapes

sergiotapia commented 7 years ago

Hi @skwerlman - thanks for the PR. I'll take a look this weekend and onboard it onto a new branch I've been working on for a while.

All of the importers now only scrape the first pages at most. We're no longer bringing in stuff wholesale because it caused delays in importing. The idea is that when something is uploaded Magnetissimo should pick it up almost instantly. Instead of waiting for the 100+ pages to finish processing, then start over to pick up new content.

I think RSS fetching is definitely the best approach here.

PR is here if you want to take a look @skwerlman https://github.com/sergiotapia/magnetissimo/pull/72

skwerlman commented 7 years ago

I am looking at reimplementing this using rss based on the new branch today; i'll open a pr with the new code when its done