knightcrawler-stremio / knightcrawler

A selfhosted Stremio addon
Apache License 2.0
238 stars 39 forks source link

DMM scraping may be broken after 999 hashlists #95

Closed funkypenguin closed 4 months ago

funkypenguin commented 4 months ago

Describe the bug

A user recently pointed out to me that recently-added content to DMM is no longer appearing at https://knightcrawler.elfhosted.com - I checked my producer logs, and saw this error:

10:51:25 [Information] [Producer.Crawlers.Sites.TpbCrawler] Starting "TPB" crawl
10:51:25 [Information] [Producer.Crawlers.Sites.YtsCrawler] Starting "YTS" crawl
10:51:25 [Information] [Producer.Crawlers.Sites.EzTvCrawler] Ingestion Successful - Wrote 0 new torrents
10:51:25 [Information] [Producer.Crawlers.Sites.YtsCrawler] Ingestion Successful - Wrote 0 new torrents
10:51:26 [Information] [Producer.Crawlers.Sites.TpbCrawler] Ingestion Successful - Wrote 0 new torrents
10:52:05 [Error] [Quartz.Core.JobRunShell] Job Crawlers.TgxCrawler threw a JobExecutionException:
Parameters: refire = True, unscheduleFiringTrigger = False, unscheduleAllTriggers = False
 Quartz.JobExecutionException
 ---> System.Threading.Tasks.TaskCanceledException: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
 ---> System.TimeoutException: A task was canceled.
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.<SendCoreAsync>g__Core|5_0(HttpRequestMessage request, Boolean useAsync, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.<SendCoreAsync>g__Core|5_0(HttpRequestMessage request, Boolean useAsync, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.GetStringAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   --- End of inner exception stack trace ---

I restarted the producer, and while the error is gone (and IDK if it was even related), I see no fresh DMM scrapes:

11:06:05 [Information] [Quartz.ContainerConfigurationProcessor] Adding job: Crawlers.TpbCrawler
11:06:05 [Information] [Quartz.ContainerConfigurationProcessor] Adding job: Crawlers.YtsCrawler
11:06:05 [Information] [Quartz.ContainerConfigurationProcessor] Adding job: Crawlers.TgxCrawler
11:06:05 [Information] [Quartz.ContainerConfigurationProcessor] Adding job: Jobs.IPJob
11:06:05 [Information] [Quartz.ContainerConfigurationProcessor] Adding job: Jobs.PublisherJob
11:06:05 [Information] [Quartz.ContainerConfigurationProcessor] Adding job: Crawlers.DebridMediaManagerCrawler
11:06:05 [Information] [MassTransit] Bus started: rabbitmq://10.43.0.2/
11:06:24 [Information] [Producer.Crawlers.Sites.EzTvCrawler] Starting "EZTV" crawl
11:06:24 [Information] [Producer.Crawlers.Sites.TpbCrawler] Starting "TPB" crawl
11:06:24 [Information] [Producer.Crawlers.Sites.YtsCrawler] Starting "YTS" crawl
11:06:24 [Information] [Producer.Crawlers.Sites.TgxCrawler] Starting "TorrentGalaxy" crawl
11:06:25 [Information] [Producer.Services.IpService] Public IP Address: "94.130.70.123"
11:06:25 [Information] [Producer.Crawlers.Sites.EzTvCrawler] Ingestion Successful - Wrote 0 new torrents
11:06:26 [Information] [Producer.Crawlers.Sites.YtsCrawler] Ingestion Successful - Wrote 0 new torrents
11:06:27 [Information] [Producer.Crawlers.Sites.TgxCrawler] Ingestion Successful - Wrote 0 new torrents

At first glance, the issue may be that the DMM crawler does this:

public partial class DebridMediaManagerCrawler(
    IHttpClientFactory httpClientFactory,
    ILogger<DebridMediaManagerCrawler> logger,
    IDataStorage storage,
    GithubConfiguration githubConfiguration) : BaseCrawler(logger, storage)
{
    [GeneratedRegex("""<iframe src="https:\/\/debridmediamanager.com\/hashlist#(.*)"></iframe>""")]
    private static partial Regex HashCollectionMatcher();

    [GeneratedRegex(@"[sS]([0-9]{1,2})|seasons?[\s-]?([0-9]{1,2})", RegexOptions.IgnoreCase, "en-GB")]
    private static partial Regex SeasonMatcher();

    protected override IReadOnlyDictionary<string, string> Mappings => new Dictionary<string, string>();
    protected override string Url => "https://api.github.com/repos/debridmediamanager/hashlists/contents";
    protected override string Source => "DMM";

But https://api.github.com/repos/debridmediamanager/hashlists/contents only returns 999 results...

funkypenguin commented 4 months ago

Riffing on my theory here, the 1,000 truncated entries appear to be alphanumerically sorted:

image

So that being the case, we've probably been scraping less and less fresh content over time