Problem recognising series with some old URLs in tvshow.nfo files

thetvdb / metadata.tvshows.thetvdb.com.v4.python

TheTVDB Official Kodi TV plugin

9 stars 3 forks source link

Problem recognising series with some old URLs in tvshow.nfo files #16

Closed jhenstridge closed 1 year ago

jhenstridge commented 1 year ago

I was testing out the plugin with my library, and found it failed to scan some files with tvshow.nfo files in them. These included old thetvdb.com URLs of the form:

http://thetvdb.com/?tab=series&id=204781&lid=7

These seemed to work with the old scraper maintained by the Kodi team. Having a look through the source code, the URI seems to match this regexp:

https://github.com/thetvdb/metadata.tvshows.thetvdb.com.v4.python/blob/10cb449edbc76d3e85b383b2dea5cbf8302600b8/metadata.tvshows.thetvdb.com.v4.python/resources/lib/nfo.py#L20

... but it extracts the lid value rather than id. Editing the file and removing the &lid=7 bit at the end of the URL allows the scraper to recognise the series.

I think changing the regexp to something like the following would work:

 r'(thetvdb)\.com[\w=&\?/]*[&\?]id=(\d+)',

That is, requiring a ? or & immediately before the id parameter.

KarellenX commented 1 year ago

Hello @jhenstridge

It's a known issue... https://github.com/xbmc/xbmc/issues/19845

With a fix merged a couple of weeks ago... https://github.com/xbmc/metadata.tvshows.themoviedb.org.python/pull/88

Is it the same fix as yours?

jhenstridge commented 1 year ago

My tvshow.nfo file consisted of a plain URL rather than being XML, so I don't think that bug is related. And removing the lid parameter seemed to fix the problem.

Here is an example of the regexp matching the URL I mentioned:

>>> import re
>>> url = 'http://thetvdb.com/?tab=series&id=204781&lid=7'
>>> match = re.search(r'(thetvdb)\.com[\w=&\?/]+id=(\d+)', url)
>>> match.groups()
('thetvdb', '7')
>>>

There's no series with id=7, so it fails to match the series.

jhenstridge commented 1 year ago

Looking closer at the themoviedb scraper, it looks like it is using a different regexp to extract the ID from this style of URL compared to this plugin:

https://github.com/xbmc/metadata.tvshows.themoviedb.org.python/blob/matrix/libs/data_utils.py#L52

It seems to handle URLs with the lid parameter correctly:

>>> re.search(r'(thetvdb)\.com.+&id=(\d+)', url).groups()
('thetvdb', '204781')
>>>

KarellenX commented 1 year ago

Oh, sorry. You were reporting a Parsing NFO issue. I must have been half asleep and just jumped to a known similar issue.

@pkscout will check it out.

pkscout commented 1 year ago

This is the TVDB scraper, right? That's not the one I'm maintaining. I know we just updated the scrapers using the movie database with some different regex parsing to handle old TVDB urls better, and I think the comments here indicate the TVDB team need to update their scraper as well to deal with older URL formats from their site.

jhenstridge commented 1 year ago

Yep. I am using the "The TVDB v4" scraper. This looked like the right place to file the bug report, and the code here seems to match the behaviour I observed.

KarellenX commented 1 year ago

TVDB TV Shows... https://github.com/thetvdb/metadata.tvshows.thetvdb.com.v4.python/issues

TVDB Movies... https://github.com/thetvdb/metadata.movies.thetvdb.com.v4.python

antheaezzell commented 1 year ago

This has been internally ticketed for review - https://mediamorph.atlassian.net/browse/TVD-3391