guessit-io / guessit

GuessIt is a python library that extracts as much information as possible from a video filename.
https://guessit-io.github.io/guessit
GNU Lesser General Public License v3.0
814 stars 92 forks source link

release_group matching is too greedy. #248

Closed labrys closed 8 years ago

labrys commented 8 years ago

A simple quick example parsing the RARBG RSS feed right now:

import requests
from bs4 import BeautifulSoup
from guessit import guessit

feed = requests.get('http://rarbg.com/rss.php')
if feed:
    soup = BeautifulSoup(feed.content, 'html.parser')
    groups = {guessit(str(g.title.string)).get('release_group') for g in soup.rss.channel.find_all('item')}
    print(groups)

spamTV[rartv] instead of spamTV RARBG instead of None UAV[rartv] instead of UAV BRISK[rartv] instead of BRISK

And while these are fairly minor, I've seen some more insidious ones in the past that might be more difficult such as 720p being tacked on to the release group (the name left out a separator between the two).

Edited: As @rarbg mentioned RARBG matched correctly.

rarbg commented 8 years ago

RARBG instead of None < should be RARBG release group :P The rest i agree

labrys commented 8 years ago

@rarbg doh you're right! Wasn't thinking.

Toilal commented 8 years ago

Problem is that in some other context, [...] may be a part of the release group name. We have already discussed of this particular point, and decided to keep those release groups as is.

It's still possible (and easy) to post process the release_group value to remove the [...] part if you need it in your context.

labrys commented 8 years ago

@Toilal I agree with that, what was of more concern was when things like 720p get attached to the release group. There were some others I had noticed when I first started the testing but that was a before the holidays so other than the instance of 720p I can't remember which ones. I'll be continuing some testing and any others I note I'll be sure to report. As of right now we strip known things like [rartv] already so those shouldn't really be an issue.

Toilal commented 8 years ago

It would be great to find out those failing example from your memory :)

labrys commented 8 years ago

Right now its my memory that's failing :grinning: But iirc, they were generally all due to poorly named files such as missing separators between fields.

rarbg commented 8 years ago

@Toilal @labrys I have some ideas of what might fail , so ill just give you some examples of what to run against guessit ( i wont do it because im lazy :P )

TEST.S01E02.2160p.NF.WEBRip.x264.DD5.1-ABC TEST.2015.12.30.720p.WEBRip.h264-ABC TEST.S01E10.24.1080p.NF.WEBRip.AAC2.0.x264-ABC TEST.S01E10.24.1080p.NF.WEBRip.AAC.2.0.x264-ABC TEST.S05E02.720p.iP.WEBRip.AAC2.0.H264-ABC TEST.S03E07.720p.WEBRip.AAC2.0.x264-ABC TEST.S15E15.24.1080p.FREE.WEBRip.AAC2.0.x264-ABC TEST.S11E11.24.720p.ETV.WEBRip.AAC2.0.x264-ABC TEST.2015.1080p.HC.WEBRip.x264.AAC2.0-ABC TEST.2015.1080p.3D.BluRay.Half-SBS.x264.DTS-HD.MA.7.1-ABC TEST.2015.1080p.3D.BluRay.Half-OU.x264.DTS-HD.MA.7.1-ABC TEST.2015.1080p.3D.BluRay.Half-OU.x264.DTS-HD.MA.TrueHD.7.1.Atmos-ABC TEST.2015.1080p.3D.BluRay.Half-SBS.x264.DTS-HD.MA.TrueHD.7.1.Atmos-ABC TEST.2015.1080p.BluRay.REMUX.AVC.DTS-HD.MA.TrueHD.7.1.Atmos-ABC

:"D

labrys commented 8 years ago

@rarbg The last three detected release group as Atmos-ABC the others detected ABC correctly

Toilal commented 8 years ago

how about 720p/1080p/2160p ? All is ok ?

labrys commented 8 years ago

Yup, no other false matches. I started logging guesses from an RSS stream this morning, after I get a decent sampling of results (maybe a day or two's worth) I'll analyze and report back.

Toilal commented 8 years ago

So if Atmos does not belong to release_group, what does this mean ? Is this related to TrueHD 7.1 ?

labrys commented 8 years ago

http://www.dolby.com/us/en/brands/dolby-atmos.html

Toilal commented 8 years ago

Thx ! I've created another issue #249, fixing this one will fix those ones too.

labrys commented 8 years ago

Also, not sure if you are trying to match or not, but Half-SBS and Half-OU weren't detected either.

rarbg commented 8 years ago

@Toilal @labrys Ofcourse My examples werent only for group names :P Btw about #249 My example of truehd 7.1 atmos should be detected also as dual codec TEST.2015.1080p.3D.BluRay.Half-OU.x264.DTS-HD.MA.TrueHD.7.1.Atmos-ABC TEST.2015.1080p.3D.BluRay.Half-SBS.x264.DTS-HD.MA.TrueHD.7.1.Atmos-ABC TEST.2015.1080p.BluRay.REMUX.AVC.DTS-HD.MA.TrueHD.7.1.Atmos-ABC DTS-HD Master Audio 7.1 + TrueHD Atmos 7.1

Toilal commented 8 years ago

I'm really happy with the behavior of guessit 2 with all those examples. It detects both audio_codec values like a charm where guessit 1 would have guessed only one.

For: TEST.2015.1080p.3D.BluRay.Half-SBS.x264.DTS-HD.MA.TrueHD.7.1.Atmos-ABC
GuessIt found: {
    "title": "TEST",
    "year": 2015,
    "screen_size": "1080p",
    "other": "3D",
    "format": "BluRay",
    "video_codec": "h264",
    "audio_codec": [
        "DTS",
        "TrueHD"
    ],
    "audio_profile": "HDMA",
    "audio_channels": "7.1",
    "release_group": "Atmos-ABC",
    "type": "movie"
}

Could you tell me what does Half-OU and Half-SBS mean ?

rarbg commented 8 years ago

@Toilal For: TEST.2015.1080p.3D.BluRay.Half-SBS.x264.DTS-HD.MA.TrueHD.7.1.Atmos-ABC There are 3 codecs in this DTS DTS HD Master Audio TrueHD Atmos

Half-SBS / Half-OU are 3d formats , top-bottom multiplexing or side-by-side multiplexing Half-SBS = http://imagecurl.com/viewer.php?file=93070895678710446384.png Half-OU = http://imagecurl.com/viewer.php?file=71497653032519412231.png

Toilal commented 8 years ago

Guessit will grab DTS, TrueHD and DolbyAtmos as audio_codec, and HDMA as audio_profile. I think it's quite good, and will be hard to implement your use case exactly without breaking other test cases.

Maybe DolbyAtmos betters fit in audio_profile than audio_codec ?

Toilal commented 8 years ago

Or maybe we could add a audio_surround property for DolbyDigital and DolbyAtmos values ?

rarbg commented 8 years ago

audio_surround is confusing because you can have a lot of codecs with surround but i dont really use guessit so its up to your users to decide :)

labrys commented 8 years ago

@Toilal I think Atmos fits better in audio_profile than audio_codec, as it can be applied to multiple codecs. I think audio_surround would create another property that's not really needed. Between codec and channels you pretty much know what's surround so I don't see a large number of tags that would need this property.

However, an option for consideration, is to modify the use of the other category. Instead of only having the other properties -> other extend the usage of other to all sections. Thus you could have audio_other and then move properties such as DualAudio and AudioFix to that section. Same could apply to tags such as WideScreen, Netflix, Half-SBS, and Screener which could be moved to video_other etc.

This would allow you to only test for those tags in the other properties category that really don't fall in the primary categories. other properties -> other could then be used for tags that really don't fit in the other categories or for which the category isn't apparent from the usage within the name. E.g. HD that should fall in audio properties or one that belongs in video properties.

Also with other sections for each category, when enough other tags are generated to merit their own property, you can create that property and migrate them. Backwards compatibility could be maintained through a union of the <category>_other property and the new properties based on the requested guessit version.

labrys commented 8 years ago

Also when parsing the tests results to this point I noticed mimetype and alternativeTitle aren't listed in your documentation. Also shouldn't alternativeTitle be alternative_title?

And since you have mimetype shouldn't you also have clowntype? :grinning:

labrys commented 8 years ago

Release group results so far:

@rarbg If you see any you think are miscategorized please let me know.

Edited: Moved HDMI to release groups.

rarbg commented 8 years ago

@labrys HDMI is a release group

Toilal commented 8 years ago

@labrys Could you provide the whole filename for Teacher, SPLIT SCENES and 320 kbps ?

Also what's wrong with SPARKSs, CiNEFiLEs, NODLABSs, DAAs ? What does the ending s mean and could you provide the whole filename too ?

I'll add the new property audio_bitrate for 320 kbps. (see #251)

Toilal commented 8 years ago

and SPLIT SCENES will be guessed as other = Bonus

Toilal commented 8 years ago

But thank you for all those examples and feedback on guessit, it really help ! Feel free to add a maximum of failing examples in github issue tracker, i'll try to fix them and add those examples to unit tests to ensure non-regression for future versions.

Toilal commented 8 years ago

I mark this particular issue as won't fix, because original issue was about the greedy release_group values, like BRISK[rartv] and so ... As those case are too much context dependent, i'll let guessit guess like this and it's up to the user to split the release_group again if needed.

But i've created dedicated issues for all others issues with have discussed in this thread.

labrys commented 8 years ago

I'll continue running the tests and provide additional feedback. I'm fine with this issue being closed for now. I haven't run the stats but a rough guess would be that over 95% guess correctly which is pretty impressive, so great work :+1:! Also thanks to @rarbg for your input and for the scrapes :grinning: I'll provide final stats after this test has run for a while and information on other properties that don't match correctly. I'd also like to add scrapes from other providers to the list, time permitting. In the meantime here's the requested information:

My TS Teacher 2015 WEB-DL

Looks like the 320 KBPS were from some audio that got included in the list, I've included one:

VA - Trance Desire Volume 59 (2016) MP3 [320 kbps]

Looks like the excess s on the end of the release groups is just how they labeled it even though its not part of the release group's name. Nothing you can do about that. Examples:

The.Imaginarium.Of.Doctor.Parnassus.2009.1080p.BluRay.x264-CiNEFiLEs
The.Lone.Ranger.2013.1080p.BluRay.x264-DAAs