guessit-io / guessit

GuessIt is a python library that extracts as much information as possible from a video filename.
https://guessit-io.github.io/guessit
GNU Lesser General Public License v3.0
822 stars 91 forks source link

Guessit 2 #232

Closed Toilal closed 8 years ago

Toilal commented 9 years ago

Guessit is a great filename parser. It's already in use in some tools, like subliminal and Flexget, and some other well known projects like SickRage are planning to embed it in their application. I think it's now time to release guessit v1, and go ahead to next major version guessit v2.

From my (biased) point of view, it the best release name parser available. It has strengths like

But I think there are some pitfalls in guessit internals, and it's now really hard to make it better and fix remaining ssues.

For guessit v2, I plan to rewrite it from scratch. It may seems a huge work, but i've started to work on a lower level component called rebulk to register various patterns in bulk. This component already supports patterns matching in bulk, and will support some kind of rule engine to easily implements validation rules between registered patterns (ie. A regexp match must be next to another to be validated, etc ...). There's already some features required by guessit, like match formatters.

I want to make guessit2 a configuration module for rebulk, to make patterns and validation rules readable and easy to maintain. There will be no transformer anymore, only a set of patterns, processors and validation rules registered in a Rebulk object, and rebulk engine will do the rest.

@wackou as owner of this repository and initial author, are you ok with this plan ? Do you prefer v2 to be in this repository, under a new branch, or do you want me to work on another new project ? (Actual codebase could be maintained in a 1.x branch).

Diaoul commented 9 years ago

I'm not a big fan of Rebulk syntax. It's OK for short matching rules but can get really unfriendly for longer ones. Also speaking of coupling, I think the chain is limiting the capabilities and introduce some kind of coupling.

This is a challenging project, I like the idea.

Toilal commented 9 years ago

Thanks for your feedback @Diaoul ul, i'm happy you like the idea :)

I'm not sure the fluent API change anything to the coupling.

It's really only a matter of code style, and you can still use rebulk without chaining methods, like this:

rebulk = Rebulk()
rebulk.string("stringPattern")
rebulk.regex(r"reP\wttern")
rebulk.matches("input string")

But you may (only if you want to) write this, equivalent to the previous code block:

rebulk.string("strinPattern").regex(r"reP\wttern").matches("input string")

Of course, for guessit it's not possible to register all patterns in a single statement, but maybe we will use the fluent API to register different patterns for same properties, like S02E04, 02x04, 'S2015E04`...

I really want to avoid this kind of code :

self.container.register_property(None, r'(' + season_words_re.pattern + sep + '?(?P<season>' + digital_numeral + '(?:' + sep + '?' + all_separators_re.pattern + sep + '?' + digital_numeral + ')*)' + sep + '?' + season_words_re.pattern + '?)' + sep, confidence=1.0, formatter={None: parse_numeral, 'season': season_parser}, validator=ChainedValidator(DefaultValidator(), FormatterValidator('season', lambda x: len(x) > 1 if hasattr(x, '__len__') else False))) # Holly shit ! :p

So I hope some of the longer regular expression of guessit will be splitted in shorter ones with additional readable constraints/validation rules defined in rebulk. Currently, those rules are not implemented and I'm still thinking about the design of the feature, but i'm pretty sure it's possible to make things really easier.

Idea is to define some rules using flags defined on patterns (language, subtitle prefix, subtitle suffix, year, ...).

There are many rules like these in guessit internals, but they are really hard to extract from reading code (I can't remember all of the rules ...). So with rebulk, i'm trying to create a system that could make them concrete and readable from code.

Toilal commented 9 years ago

I've just release rebulk v0.2.1, and think this version is a good start to try implementing guessit 2. It will go into a new 2.x branch.

wackou commented 9 years ago

Hi @Toilal,

first of all, sorry for not replying before, I wanted to write a proper reply rather than a simple answer, and I got sidetracked...

Let me start also by thanking you for all the work you have done already on Guessit, it wouldn't be where it is lately without you and for that you deserve a lot of praise.

That being said, I believe that people writing code should be the ones that decide on the future direction of a project, so if you feel that Guessit internals are keeping the project back and that a rewrite/complete refactoring is due, you have my blessing.

I think that it would be better to keep the project name, so that it doesn't create confusion for the users. So starting to work in a 2.x branch seems like the best course of action. That way, when the rewrite is at least as good as the original, it is just a matter of making a new 2.0 release and the upgrade will be seamless for users.

As for your specific points:

All transformers are executed in a predefined ordering. This ordering is tuned and can't be changed, because each transformer assume than previous transformers have performed with an expected state of the TreeMatcher object.

I agree with that, however there are some transformations that depend on some others to be able to work properly. The static order is of course not very extensible, I think the better way would be for each plugin to declare its dependencies, and then the matcher could compute a list of all plugins with their dependencies and execute them in the correct order.

The TreeMatcher doesn't support backtracking to revert a false positive guess

Backtracking would be really nice to have, that's for sure. As you said, it would avoid hacky constructs in the code (I don't like the 2nd pass either, but at the time this was the quick'n'easy fix and I never had the time to revisit it)

About rebulk, I haven't looked thoroughly into it, but having a library that kind-of extends regexps and allows for more powerful pattern matching seems like a good idea. By the way, here's a link about a regex construct that I read recently, and found it quite interesing, you might find it useful: http://www.rexegg.com/regex-best-trick.html

So please go ahead with the rewrite, I am really eager to see what you can come up with. I know I haven't been very active lately but I'm still here, and still love to see the project continuing to evolve. I believe that it really fills a niche and performs a very useful functionality that lots of media-related projects (media centers, subtitle downloaders, download managers, ...) can benefit from. The only thing that I would ask, but you already know this :smile:, is that you keep the set of unittests, as in the end, this is what brought the project where it is and what makes sure that it can only get better, and not regress. More tests = more coverage of all the bizarre constructs that the human brain can think of for naming things, and that ends up in better detection for more users as time goes by.

Full steam ahead!

Toilal commented 9 years ago

Great :) Thanks for your support @wackou !

About regex, i'm now using regex module which is a dropin replacement for default re module with additional features we need in guessit, like repeated captures (=> episode lists) and named lists (=> list of countries/languages).

It also supports a concurrent option that disables GIL, i'll try this later with a thread pool to see if it enhance global performance ...

Unfortunatelly, regex module doesn't compile on pypy, but do we really need pypy support ? Also I didn't try to pip install regex on windows, hope it works ...

For unit tests, I plan to release 2.0-rc1 when all filenames from guessit1 are working in guessit2. Testing framework with yml files is a full rewrite, but still use the same file format as guessit with some enhancements. Rebulk is already at 100% line coverage, guessit will be also.

For migrating, i'm thinking about introducting a new single method API for guessit2, but also writing an adapter for quick migration (dropin replacement of guessit module).

Of course, no due date, it will probably take months, but here's the plan.

wackou commented 9 years ago

Sounds all great! :+1:

Toilal commented 9 years ago

Alpha release soon :)

wackou commented 9 years ago

wow, that was fast :)

Diaoul commented 9 years ago

Any backward incompatible changes? I think I saw title renamed to episodeTitle or something?ç

Toilal commented 9 years ago

But all test cases were migrated from guessit 1.x to guessit 2.x and are now passing, that's why I want to release the first 2.x version now: It just works ! Output should be equivalent, except confidence that has been removed, and some particular cases that have been clarified/improved and overall results beeing more consistent.

For simplification purpose, guessit 2.x API was refactored and is not compatible with guessit 1.x.

There are some incompatible change : API changed, properties renamed, options removed ... Currently there's nothing ready to help developer to migrate, but I plan 2 things that will be available in later releases (beta):

For example, --name-only was removed because guessit 2.x automatically supports parsing files with or without extension. --filetype option has been dropped too, because type property guessing is performed at the very end of the process instead of the very beginning in guessit 1.x. I think guessit 2.x is easier to use and integrate than guessit 1.x.

Custom transformers implemented for guessit 1.x requires a rewrite for 2.x using a rebulk extension. Currently there's no API to register custom rebulk extensions, but there will be one later.

So after this alpha release, i'll start working on flexget guessit 2.x migration to make sure everything is available for other apps developers to perform this migration.

When starting the 2.x branch, I though it will increase performance, but sadly it seems to be quite the same as guessit 1.x. I still need to perform real benchmarks to compare, but unit tests seems to take about the same time to run.

Diaoul commented 9 years ago

Documentation is sufficient to migrate. I won't bother with a migration module. The change from series to title is a little odd to me. You seem to have changed the point of view from episode to series. So now title is the series title. Why this change? From a pure file point of view the filename represents an episode, not a series. What's the rational behind that change?

Toilal commented 9 years ago

About the type property, we should keep episode value like in guessit 1.x you are right :+1:. I don't know why i've introduced this renaming, but it doesn't really make sense.

About the renaming from series to title and title to episode title, it's mainly related to the fact that when we guess the title property in guessit.properties.title module, we still don't know if it's an episode or movie type. The type guessing is performed at the end of the process, with type post_processor.

So it would be possible to keep the property names from guessit 1.x by renaming properties after we have found the file type at the end, but it introduce some confusion in guessit codebase because rules using title property in rules may refer to series or title as the result property ...

Also one other change is about episodeList support. In guessit 1.x, there's episodeNumber and episodeList properties, but in 2.x there's only episodeNumber and it can contain a single int value, or a list[int]. (same for season).

So the global question is, please give answer @wackou and @Diaoul : should we ...

PS: We have already discussed this in https://github.com/wackou/guessit/issues/91, but maybe your opinion has changed with this rewrite ...

Diaoul commented 9 years ago

Good news for the type :+1:

About series/title/episodeTitle now I get the reason. If we leave implementation details aside, I think this needs to be documented to avoid confusion at least. Both are valid options, it is all about the point of view. As a user of the library I'd prefer to have series or show_name rather than title. episode rather than episodeNumber. The shorter the better. Lower case with underscore rather than camelCase for pythonic reasons.

About the numbering when you have multiple subtitles, I'm OK with the new way, I think it's more pythonic.

Toilal commented 9 years ago

Ok so I also rename all camelCase properties to underscore_case, think it's better too for a python lib.

Diaoul commented 9 years ago

I saw you put some work in the CLI. I suggest click as a replacement for anything you currently have. I'm using it for subliminal and it's really nice.

Toilal commented 9 years ago

Thanks ! I didn't heard about click before. It seems great, but i'm not sure I want to add an extra dependency to handle command line arguments.

Diaoul commented 9 years ago

I get your point and that's why I was reluctant to use it. But dependencies are managed by pip and the features outweight this IMO (solves unicode handling, terminal encoding issue). However, guessit has little use of the CLI compared to subliminal.

Toilal commented 9 years ago

If it solves unicode issues with various terminal and OS, I think i'll do the migration :). Thanks for pointing this out !

Toilal commented 9 years ago

Working alpha is there ! pip install guessit --pre to install :)

Toilal commented 8 years ago

RC1 is released ! I think i'm other with it. I've performed migration for flexget, and it seems to work like a charm :)

Toilal commented 8 years ago

I've just moved 2.x to master, and master to 1.x

labrys commented 8 years ago

I'm one of the SickRage dev's. I'm doing the prelim work on switching us over to using GuessIt 2 as our primary parser. Couple of things I've noticed is release group matching is a little too greedy at times.

In addition, I would also like to request that you add XXX to your detections so we can filter out adult content for those that have their libraries directly updated by SickRage.

Thanks for all the effort put into this though as it will make SickRage much more capable in the near future.

Toilal commented 8 years ago

I'll create a new issue for XXX recognition and will implement it before 2.0 stable release. Could you create another issue with examples of greedies release groups ? It would help to have a finer behavior.

labrys commented 8 years ago

Sure, I'll try to run a test this weekend to provide more comprehensive examples when I create the issue.

A simple quick example parsing the RARBG RSS feed right now:

UAV[rartv] instead of UAV RARBG instead of None BRISK[rartv] instead of BRISK

And while these are fairly minor, I've seen some more insidious ones in the past that might be more difficult such as 720p being tacked on to the release group (the name left out a separator between the two).

Toilal commented 8 years ago

It's there now ! I've just released guessit 2.0.0 (final). Thank you all for advices and feedbacks, it really helped to make this version better :)

ratoaq2 commented 6 years ago

That was the right move. I can't say much about guessit 1.x since I started when 2.0 was already released. But I'm glad @wackou created such nice tool.

I must say that rebulk API is really nice and powerful. I was just new to python and I could work and contribute with these two great projects: guessit and subliminal

It was really fast for me to read guessit code and understand what it was doing. Run the patterns, execute the rules, output the result. @Toilal you have my respect. Rebulk and guessit are awesome.

And the best way to learn python is creating a subliminal provider. Subliminal code is one of the cleanest that I've seen. @Diaoul the code master.

Just wanted to share a positive feedback.

Toilal commented 6 years ago

I'm very pleased to hear such compliment. I have to thank you back too for all issues you were able to cleanly fix too, guessit is alive and here to stay for a long time thanks to you :)