User-agent appears to be blocked by thepiratebay (and other issues with pirate bay)

csm10495 commented 1 year ago

Describe the bug When searching via thepiratebay, i get back a 403. It's related to the user-agent string.

To Reproduce Try to search for something via thepiratebay

Expected behavior I'd be able to use thepiratebay as a provider.

Screenshots N/A

Medusa (please complete the following information):

OS: Win10x64
Branch: master
Commit: 6b37ad0ae6dfc3c2b102e2d9a377ced1fc849476
Python version: 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]
Database: 44.19

Debug logs (at least 50 lines):

2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] No data returned from provider
2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] The response returned a non-200 response while requesting url https://thepiratebay.org/search/AEW All Access S01E02/0/3/200 Error: HTTPError('403 Client Error: Forbidden for url: https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200')
2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] User-Agent: Medusa/1.0.13 (Windows; 10; 4ed4e6ac-f6a3-11ed-ac91-0800271dbc99)
2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] GET URL: https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200 [Status: 403]

Additional context I used request to check the theory:

In [7]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'LOL
   ...: '})
Out[7]: <Response [200]>

In [8]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa/1.0.13 (Windows; 10; 4ed
   ...: 4e6ac-f6a3-11ed-ac91-080027'})
Out[8]: <Response [403]>

In [9]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'LOL'})
Out[9]: <Response [200]>

Maybe a user-setable user-agent would be helpful here?

csm10495 commented 1 year ago

More examples of user agent testing:

In [13]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa/1.0.12'})
Out[13]: <Response [403]>

In [14]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa2/1.0.12'})
Out[14]: <Response [200]>

In [15]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa/1.0.12'})
Out[15]: <Response [403]>

csm10495 commented 1 year ago

Gee whiz so even after that, it sort of looks like the parsing for thepiratebay is off. It looks like the page loads using javascript, while the parser seems to assume static contents that it can easily parse. Anyone else seeing this?

csm10495 commented 1 year ago

Using: https://thepiratebay7.com as an alternative url seems to be working.

csm10495 commented 1 year ago

It looks like under the hood there is an api that can be called instead of parsing the html:

https://apibay.org/q.php?q=

Like:

https://apibay.org/q.php?q=The+price+is+right

Jackett seems to be using it already: https://github.com/Jackett/Jackett/pull/9593/files

BenjV commented 1 year ago

I don't have any problem with the medusa useragent on the piratebay. Maybe you provider is blocking it?

Anyway, it would be a good idea if the Medusa developers changed from screen scraping to using this new api.

csm10495 commented 1 year ago

I don't think it's my provider since it directly affects a user agent and I think user agents aren't sniffable with https.

pymedusa / Medusa

User-agent appears to be blocked by thepiratebay (and other issues with pirate bay) #11226