pymedusa / Medusa

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.
https://pymedusa.com
GNU General Public License v3.0
1.81k stars 278 forks source link

User-agent appears to be blocked by thepiratebay (and other issues with pirate bay) #11226

Open csm10495 opened 1 year ago

csm10495 commented 1 year ago

Describe the bug When searching via thepiratebay, i get back a 403. It's related to the user-agent string.

To Reproduce Try to search for something via thepiratebay

Expected behavior I'd be able to use thepiratebay as a provider.

Screenshots N/A

Medusa (please complete the following information):

Debug logs (at least 50 lines):

2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] No data returned from provider
2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] The response returned a non-200 response while requesting url https://thepiratebay.org/search/AEW All Access S01E02/0/3/200 Error: HTTPError('403 Client Error: Forbidden for url: https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200')
2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] User-Agent: Medusa/1.0.13 (Windows; 10; 4ed4e6ac-f6a3-11ed-ac91-0800271dbc99)
2023-05-19 17:30:33 DEBUG   FORCEDSEARCHQUEUE-MANUAL-431454 :: ThePirateBay :: [6b37ad0] GET URL: https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200 [Status: 403]

Additional context I used request to check the theory:

In [7]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'LOL
   ...: '})
Out[7]: <Response [200]>

In [8]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa/1.0.13 (Windows; 10; 4ed
   ...: 4e6ac-f6a3-11ed-ac91-080027'})
Out[8]: <Response [403]>

In [9]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'LOL'})
Out[9]: <Response [200]>

Maybe a user-setable user-agent would be helpful here?

csm10495 commented 1 year ago

More examples of user agent testing:

In [13]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa/1.0.12'})
Out[13]: <Response [403]>

In [14]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa2/1.0.12'})
Out[14]: <Response [200]>

In [15]: requests.get('https://thepiratebay.org/search/AEW%20All%20Access%20S01E02/0/3/200', headers={'User-Agent': 'Medusa/1.0.12'})
Out[15]: <Response [403]>
csm10495 commented 1 year ago

Gee whiz so even after that, it sort of looks like the parsing for thepiratebay is off. It looks like the page loads using javascript, while the parser seems to assume static contents that it can easily parse. Anyone else seeing this?

csm10495 commented 1 year ago

Using: https://thepiratebay7.com as an alternative url seems to be working.

csm10495 commented 1 year ago

It looks like under the hood there is an api that can be called instead of parsing the html:

https://apibay.org/q.php?q=

Like:

https://apibay.org/q.php?q=The+price+is+right

Jackett seems to be using it already: https://github.com/Jackett/Jackett/pull/9593/files

BenjV commented 1 year ago

I don't have any problem with the medusa useragent on the piratebay. Maybe you provider is blocking it?

Anyway, it would be a good idea if the Medusa developers changed from screen scraping to using this new api.

csm10495 commented 1 year ago

I don't think it's my provider since it directly affects a user agent and I think user agents aren't sniffable with https.