guessit-io / guessit

GuessIt is a python library that extracts as much information as possible from a video filename.
https://guessit-io.github.io/guessit
GNU Lesser General Public License v3.0
826 stars 92 forks source link

Wrong guess of movies when the resolution/screen size does not end with 'p' #693

Open nachocho opened 3 years ago

nachocho commented 3 years ago

Hi, Not sure this qualifies as a bug, but I think so. Subscene has several subtitles where the release-info (which is supposed to come from an actual movie releases) produces a wrong guess.

For example (note the resolution does not end with 'p'): guessit('Gladiator.EXTENDED.2000.720.BrRip.264.YIFY') guessit('Gladiator.EXTENDED.2000.1080.BrRip.264.YIFY')

produces MatchesDict([('title', 'Gladiator'), ('edition', 'Extended'), ('year', 2000), ('season', 7), ('episode', 20), ('source', 'Blu-ray'), ('other', ['Reencoded', 'Rip']), ('release_group', '264.YIFY'), ('type', 'episode')])

and

MatchesDict([('title', 'Gladiator'), ('edition', 'Extended'), ('year', 2000), ('season', 10), ('episode', 80), ('source', 'Blu-ray'), ('other', ['Reencoded', 'Rip']), ('release_group', '264.YIFY'), ('type', 'episode')])

respectively. Which is wrong, for some reason guessit is interpreting the 720 as season 7 episode 20 and the 1080 as season 10 and episode 80. There is no separator in those numbers to start with, that is why I think it is wrong and an assumption like that may break other release names. So this is making the guess type also wrong, these are movies.

Another, tougher to guess, example is:

guessit('Gladiator 23.976 FPS') guessit('Gladiator 25.000 FPS')

which produces results:

MatchesDict([('title', 'Gladiator'), ('episode', [23, 76]), ('season', 9), ('episode_title', 'FPS'), ('type', 'episode')])

and

MatchesDict([('title', 'Gladiator'), ('episode', [25, 0]), ('season', 0), ('episode_title', 'FPS'), ('type', 'episode')])

respectively. And I know these are tough ones, maybe even invalid titles for guessit, but again, the way it is assuming episodes and season looks odd, how come 23.976 translates into season 9 with episodes 23 and 76 and 25.000 translates into season 0 with episodes 25 and 0? So that also makes it guess these are episodes, not movies.

Here is another example:

guessit('Aliens DVD Silver Box Set 131 Min')

produces

MatchesDict([('title', 'Aliens'), ('source', 'DVD'), ('season', 1), ('episode', 31), ('episode_title', 'Min'), ('type', 'episode')])

again, guessing this is an episode instead of a movie, and treating the number 131 (with no separator whatsoever) as season and episode number.

I am hoping these examples help improving the product (which is great!) if the bug report is accepted.

Thanks

Toilal commented 3 years ago

for some reason guessit is interpreting the 720 as season 7 episode 20 and the 1080 as season 10 and episode 80

This is a common pattern for some episode numbering in anime scene, that's why it's guessed as season/episode. I'm not sure I want to fix this one. Same for 131 case, in fact, technicaly and statistically speaking, it's more likely to be an episode than a movie.

For the FPS thing, it's another problem and could be fixed with a new property, what about frame_rate ?

ratoaq2 commented 3 years ago

frame_rate is a good choice.

When I implemented https://github.com/ratoaq2/knowit I tried to have the names consistent with guessit and I used frame_rate for that

nachocho commented 3 years ago

for some reason guessit is interpreting the 720 as season 7 episode 20 and the 1080 as season 10 and episode 80

This is a common pattern for some episode numbering in anime scene, that's why it's guessed as season/episode. I'm not sure I want to fix this one. Same for 131 case, in fact, technicaly and statistically speaking, it's more likely to be an episode than a movie.

For the FPS thing, it's another problem and could be fixed with a new property, what about frame_rate ?

Well I would think it is more common to have movies with release information of the form I gave (with a resolution like 'Gladiator.EXTENDED.2000.720.BrRip.264.YIFY') than anime episode titles that use a single number. Honestly merging season and episode in a single number looks plain wrong. But I understand the intention is to support them, and IMO these type of movies which are VERY common should be ideally supported.

It would be nice to have support for the FPS, but a correct guessing of movies with resolution is more important (because it is common) IMO.

Thanks.

Toilal commented 3 years ago

I see and understand. This could happen, but with a flag/mode.

Toilal commented 3 years ago

@nachocho Does 131 Min stands for the duration of the media ?

Toilal commented 3 years ago

In fact, frame_rate already exists, but doesn't support .000 nor fps to be separated with a space. I'll fix it.

nachocho commented 3 years ago

@nachocho Does 131 Min stands for the duration of the media ?

Yes, in this example:

guessit('Aliens DVD Silver Box Set 131 Min')

131 Min stands for the duration of 131 minutes. I do know that it is extremely difficult (if not impossible) to account for every single case out there. I would say the 131 Min can be disregarded if needed. The important thing here is to not treat this match as an episode of season 1 episode 31. Of course if you think also guessing the duration of the media is possible, even better.

Toilal commented 3 years ago

I can add a pattern to guess this duration so this will not be guessed as season/episode anymore.

Gladiator.EXTENDED.2000.720.BrRip.264.YIFY
Gladiator.EXTENDED.2000.1080.BrRip.264.YIFY

Those case are harder to solve ... Maybe we could add screenSize patterns without p when year is already guessed, but I have to check if it doesn't break other test cases.

nachocho commented 3 years ago

I can add a pattern to guess this duration so this will not be guessed as season/episode anymore.

Gladiator.EXTENDED.2000.720.BrRip.264.YIFY
Gladiator.EXTENDED.2000.1080.BrRip.264.YIFY

Those case are harder to solve ... Maybe we could add screenSize patterns without p when year is already guessed, but I have to check if it doesn't break other test cases.

I agree the release info looks wrong, and resolution should have a 'p' at the end. On the other hand, resolutions are pretty standard, I would say if I get a 1080, 2160, instead of season and episode it is most likely a resolution and should be treated as such, regardless of where you find it. Season 10 ep 80 or season 21 ep 60 is really unlikely. Season 7 ep 20 could be more common, and maybe if 720 is found, some other things would need to be considered.

This is just a thought, but of course you know better how to handle it.

Thanks.