Additional crawlers progress tracker

Gabisonfire commented 5 months ago

Re implement scrapers from the upstream repo

1337x
torrent9
nyaaPantsu
nyaaSis
eztv

purple-emily commented 5 months ago

Looking for suggestions on the easiest way to get started with this. 1337x is a pain. Anyone got any bright ideas?

purple-emily commented 4 months ago

Right I have an update!

In its current form you'll need some dev experience to get this running so if you are a casual user please be wary.

Heres a full EZTV scraper. You'll need to run this on a system that you can access the Knight Crawler Postgres db and an instance of RabbitMQ from. This can be the same RabbitMQ that Knight Crawler uses or a different temporary one.

https://github.com/purple-emily/knight-crawler-scrapers-dirty

It uses python with poetry to install the dependencies. If anyone wants a quick guide on how to run it let me know.

Start one producer and one/two consumers and you should be good.

This is generally a single use script as Knight Crawler gets the most recent releases from EZTV already. You can abort and resume running at any time and the script should take care of this for you.

This will add at least 200,000 new torrents from initial runs. Final numbers to be confirmed later.

This is essentially an alpha release so use with caution. Back up Postgres before running.

It should take between an hour and two hours to get the data and no confirmed numbers on processing it all.

Runs on any system with python. I have provided a start script for each service. ./start_producer.sh or the same command with the .ps1 for windows and then the same for start_a_consumer

purple-emily commented 4 months ago

Taking requests for what everyone would like me to prioritise next.

@iPromKnight I don't know if you want to take the logic I have created and convert it to C#. Once we have done a single "full scrape" we don't really have to repeat it. Following the RSS feed gets us all the new releases anyway.

sleeyax commented 4 months ago

Once we have done a single "full scrape" we don't really have to repeat it.

That's only when the database is shared (or importable) right?

purple-emily commented 4 months ago

Once we have done a single "full scrape" we don't really have to repeat it.

That's only when the database is shared (or importable) right?

Essentially run the scraper once to get all of the history, and then the RSS feed crawler will keep it up to date.

sleeyax commented 4 months ago

You already said that in your previous comment and I get that. What I mean is where is this scraped history stored? If it's stored in your local database only then no one else can access it unless they also scrape it themselves.

What I'm trying to get at is this: if KC users are expected to run the EZTV scraper themselves to fetch all of the initial history, I think it would make sense to rewrite your POC to C# for consistency. If the DB is somehow shared, then it doesn't matter as much imo.

Gabisonfire commented 4 months ago

@sleeyax It's stored in the local database, we don't have any sort of database sharing at the moment, but it is definitely something I'd like to see happening, but it's going to be a lot of work.

As far as the language is concerned, I don't see a big issue with supporting multiple as long as it's pretty much plug and play. All that matters is that the database schema is respected.

iPromKnight commented 4 months ago

The problem with having a shared database is then we become susceptible to DMCA actions

When media takedown requests are issued, they are against the hash for the magnet as well as the content.

That's why I've been reluctant to implement anything for that, and rely solely on external.

I'm toying with the idea of taking the idea for #45 and expanding on that so that for a preseed action it could get the cinemeta known IMDb id list and just process lookups in parallel for them using the Helios compatible provider definitions. This would make scraping outside of rss much more maintainable as we'd have a generic processing pipeline with scrape actions defined in json. It'd also mean that users can easily add their own

iPromKnight commented 4 months ago

One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃

purple-emily commented 4 months ago

One of the reasons I wanted to redo the consumer in typescript and didn't rewrite in c# was to kind of show that with a service bus in place it doesn't matter what tech we write services in 😃

I’m going to do a refactor of the “deep eztv” crawler I’ve written and then try and use it as a framework to make more. nyaa.si has a rss feed. How easy would it be to add it to the c# scraper?

iPromKnight commented 4 months ago

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

purple-emily commented 4 months ago

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

You think that’s something you could do? Or does anyone else want to offer to do it? I can make that the next deep scraper as it’s our most requested in Discord

Gabisonfire commented 4 months ago

@purple-emily I can take care of it. Was going to do Torrent9 but I can probably do both.

purple-emily commented 4 months ago

@iPromKnight you not able to make a throwaway and join discord even if it’s just to stick it on mute and never speak in the group context so me or Gabi can keep in contact?

purple-emily commented 4 months ago

As per #98 we now support new releases from nyaa.si.

Support for scraping old releases to come

dmitrc commented 1 month ago

If it's rss really easy as we have an abstract xml scraper. Just have to derive from that and override the methods required

Did we add that abstract XML scraper by any chance? That could be useful for adding niche trackers with rich catalogs, like RuTracker etc.

knightcrawler-stremio / knightcrawler

Additional crawlers progress tracker #29