Closed Toilal closed 8 years ago
I'm not a big fan of Rebulk syntax. It's OK for short matching rules but can get really unfriendly for longer ones. Also speaking of coupling, I think the chain is limiting the capabilities and introduce some kind of coupling.
This is a challenging project, I like the idea.
Thanks for your feedback @Diaoul ul, i'm happy you like the idea :)
I'm not sure the fluent API change anything to the coupling.
It's really only a matter of code style, and you can still use rebulk without chaining methods, like this:
rebulk = Rebulk()
rebulk.string("stringPattern")
rebulk.regex(r"reP\wttern")
rebulk.matches("input string")
But you may (only if you want to) write this, equivalent to the previous code block:
rebulk.string("strinPattern").regex(r"reP\wttern").matches("input string")
Of course, for guessit it's not possible to register all patterns in a single statement, but maybe we will use the fluent API to register different patterns for same properties, like S02E04
, 02x04
, 'S2015E04`...
I really want to avoid this kind of code :
self.container.register_property(None, r'(' + season_words_re.pattern + sep + '?(?P<season>' + digital_numeral + '(?:' + sep + '?' + all_separators_re.pattern + sep + '?' + digital_numeral + ')*)' + sep + '?' + season_words_re.pattern + '?)' + sep, confidence=1.0, formatter={None: parse_numeral, 'season': season_parser}, validator=ChainedValidator(DefaultValidator(), FormatterValidator('season', lambda x: len(x) > 1 if hasattr(x, '__len__') else False))) # Holly shit ! :p
So I hope some of the longer regular expression of guessit will be splitted in shorter ones with additional readable constraints/validation rules defined in rebulk. Currently, those rules are not implemented and I'm still thinking about the design of the feature, but i'm pretty sure it's possible to make things really easier.
Idea is to define some rules using flags defined on patterns (language
, subtitle prefix
, subtitle suffix
, year
, ...).
language
match is prefixed by subtitle prefix
match, then flag this match as a subtitleLanguage
matchlanguage
match is followed by subtitle suffix
match, then flag this match as a subtitleLanguage
matchlanguage
match is not surrounded by any other safe
match, then drop this match because it's probably in the title.year
match is not followed by another match, then drop this match (it's probably part of the title).year
matches, keep only the last one.There are many rules like these in guessit internals, but they are really hard to extract from reading code (I can't remember all of the rules ...). So with rebulk, i'm trying to create a system that could make them concrete and readable from code.
I've just release rebulk v0.2.1, and think this version is a good start to try implementing guessit 2. It will go into a new 2.x
branch.
Hi @Toilal,
first of all, sorry for not replying before, I wanted to write a proper reply rather than a simple answer, and I got sidetracked...
Let me start also by thanking you for all the work you have done already on Guessit, it wouldn't be where it is lately without you and for that you deserve a lot of praise.
That being said, I believe that people writing code should be the ones that decide on the future direction of a project, so if you feel that Guessit internals are keeping the project back and that a rewrite/complete refactoring is due, you have my blessing.
I think that it would be better to keep the project name, so that it doesn't create confusion for the users. So starting to work in a 2.x branch seems like the best course of action. That way, when the rewrite is at least as good as the original, it is just a matter of making a new 2.0 release and the upgrade will be seamless for users.
As for your specific points:
All transformers are executed in a predefined ordering. This ordering is tuned and can't be changed, because each transformer assume than previous transformers have performed with an expected state of the TreeMatcher object.
I agree with that, however there are some transformations that depend on some others to be able to work properly. The static order is of course not very extensible, I think the better way would be for each plugin to declare its dependencies, and then the matcher could compute a list of all plugins with their dependencies and execute them in the correct order.
The TreeMatcher doesn't support backtracking to revert a false positive guess
Backtracking would be really nice to have, that's for sure. As you said, it would avoid hacky constructs in the code (I don't like the 2nd pass either, but at the time this was the quick'n'easy fix and I never had the time to revisit it)
About rebulk, I haven't looked thoroughly into it, but having a library that kind-of extends regexps and allows for more powerful pattern matching seems like a good idea. By the way, here's a link about a regex construct that I read recently, and found it quite interesing, you might find it useful: http://www.rexegg.com/regex-best-trick.html
So please go ahead with the rewrite, I am really eager to see what you can come up with. I know I haven't been very active lately but I'm still here, and still love to see the project continuing to evolve. I believe that it really fills a niche and performs a very useful functionality that lots of media-related projects (media centers, subtitle downloaders, download managers, ...) can benefit from. The only thing that I would ask, but you already know this :smile:, is that you keep the set of unittests, as in the end, this is what brought the project where it is and what makes sure that it can only get better, and not regress. More tests = more coverage of all the bizarre constructs that the human brain can think of for naming things, and that ends up in better detection for more users as time goes by.
Full steam ahead!
Great :) Thanks for your support @wackou !
About regex, i'm now using regex module which is a dropin replacement for default re
module with additional features we need in guessit, like repeated captures (=> episode lists) and named lists (=> list of countries/languages).
It also supports a concurrent option that disables GIL, i'll try this later with a thread pool to see if it enhance global performance ...
Unfortunatelly, regex module doesn't compile on pypy, but do we really need pypy support ? Also I didn't try to pip install regex
on windows, hope it works ...
For unit tests, I plan to release 2.0-rc1
when all filenames from guessit1 are working in guessit2. Testing framework with yml files is a full rewrite, but still use the same file format as guessit with some enhancements. Rebulk is already at 100% line coverage, guessit will be also.
For migrating, i'm thinking about introducting a new single method API for guessit2, but also writing an adapter for quick migration (dropin replacement of guessit module).
Of course, no due date, it will probably take months, but here's the plan.
Sounds all great! :+1:
Alpha release soon :)
wow, that was fast :)
Any backward incompatible changes? I think I saw title renamed to episodeTitle or something?ç
But all test cases were migrated from guessit 1.x
to guessit 2.x
and are now passing, that's why I want to release the first 2.x
version now: It just works ! Output should be equivalent, except confidence that has been removed, and some particular cases that have been clarified/improved and overall results beeing more consistent.
For simplification purpose, guessit 2.x API was refactored and is not compatible with guessit 1.x.
There are some incompatible change : API changed, properties renamed, options removed ... Currently there's nothing ready to help developer to migrate, but I plan 2 things that will be available in later releases (beta):
For example, --name-only
was removed because guessit 2.x
automatically supports parsing files with or without extension. --filetype
option has been dropped too, because type
property guessing is performed at the very end of the process instead of the very beginning in guessit 1.x
. I think guessit 2.x
is easier to use and integrate than guessit 1.x
.
migration
module containing class to adapt API from guessit 2.x
to guessit 1.x
will be available. This adapter will implement the old methods from guessit 1.x
(guess_file_info
...), convert input options and rename properties from guessit 2.x
output to guessit 1.x
properties.Custom transformers implemented for guessit 1.x
requires a rewrite for 2.x
using a rebulk extension. Currently there's no API to register custom rebulk extensions, but there will be one later.
So after this alpha release, i'll start working on flexget guessit 2.x
migration to make sure everything is available for other apps developers to perform this migration.
When starting the 2.x
branch, I though it will increase performance, but sadly it seems to be quite the same as guessit 1.x
. I still need to perform real benchmarks to compare, but unit tests seems to take about the same time to run.
Documentation is sufficient to migrate. I won't bother with a migration module. The change from series to title is a little odd to me. You seem to have changed the point of view from episode to series. So now title is the series title. Why this change? From a pure file point of view the filename represents an episode, not a series. What's the rational behind that change?
About the type
property, we should keep episode
value like in guessit 1.x
you are right :+1:. I don't know why i've introduced this renaming, but it doesn't really make sense.
About the renaming from series
to title
and title
to episode
title, it's mainly related to the fact that when we guess the title
property in guessit.properties.title module, we still don't know if it's an episode
or movie
type
. The type
guessing is performed at the end of the process, with type post_processor.
So it would be possible to keep the property names from guessit 1.x
by renaming properties after we have found the file type
at the end, but it introduce some confusion in guessit codebase because rules using title
property in rules may refer to series
or title
as the result property ...
Also one other change is about episodeList
support. In guessit 1.x, there's episodeNumber
and episodeList
properties, but in 2.x there's only episodeNumber
and it can contain a single int
value, or a list[int]
. (same for season).
So the global question is, please give answer @wackou and @Diaoul : should we ...
2.x
so that names are same as guessit 1.x
? 2.x
?PS: We have already discussed this in https://github.com/wackou/guessit/issues/91, but maybe your opinion has changed with this rewrite ...
Good news for the type :+1:
About series
/title
/episodeTitle
now I get the reason. If we leave implementation details aside, I think this needs to be documented to avoid confusion at least. Both are valid options, it is all about the point of view.
As a user of the library I'd prefer to have series
or show_name
rather than title
. episode
rather than episodeNumber
. The shorter the better. Lower case with underscore rather than camelCase for pythonic reasons.
About the numbering when you have multiple subtitles, I'm OK with the new way, I think it's more pythonic.
Ok so I also rename all camelCase properties to underscore_case, think it's better too for a python lib.
I saw you put some work in the CLI. I suggest click as a replacement for anything you currently have. I'm using it for subliminal and it's really nice.
Thanks ! I didn't heard about click before. It seems great, but i'm not sure I want to add an extra dependency to handle command line arguments.
I get your point and that's why I was reluctant to use it. But dependencies are managed by pip and the features outweight this IMO (solves unicode handling, terminal encoding issue). However, guessit has little use of the CLI compared to subliminal.
If it solves unicode issues with various terminal and OS, I think i'll do the migration :). Thanks for pointing this out !
Working alpha is there ! pip install guessit --pre
to install :)
RC1 is released ! I think i'm other with it. I've performed migration for flexget, and it seems to work like a charm :)
I've just moved 2.x
to master
, and master
to 1.x
I'm one of the SickRage dev's. I'm doing the prelim work on switching us over to using GuessIt 2 as our primary parser. Couple of things I've noticed is release group matching is a little too greedy at times.
In addition, I would also like to request that you add XXX to your detections so we can filter out adult content for those that have their libraries directly updated by SickRage.
Thanks for all the effort put into this though as it will make SickRage much more capable in the near future.
I'll create a new issue for XXX recognition and will implement it before 2.0 stable release. Could you create another issue with examples of greedies release groups ? It would help to have a finer behavior.
Sure, I'll try to run a test this weekend to provide more comprehensive examples when I create the issue.
A simple quick example parsing the RARBG RSS feed right now:
UAV[rartv]
instead of UAV
RARBG
instead of None
BRISK[rartv]
instead of BRISK
And while these are fairly minor, I've seen some more insidious ones in the past that might be more difficult such as 720p being tacked on to the release group (the name left out a separator between the two).
It's there now ! I've just released guessit 2.0.0 (final). Thank you all for advices and feedbacks, it really helped to make this version better :)
That was the right move. I can't say much about guessit 1.x since I started when 2.0 was already released. But I'm glad @wackou created such nice tool.
I must say that rebulk API is really nice and powerful. I was just new to python and I could work and contribute with these two great projects: guessit and subliminal
It was really fast for me to read guessit code and understand what it was doing. Run the patterns, execute the rules, output the result. @Toilal you have my respect. Rebulk and guessit are awesome.
And the best way to learn python is creating a subliminal provider. Subliminal code is one of the cleanest that I've seen. @Diaoul the code master.
Just wanted to share a positive feedback.
I'm very pleased to hear such compliment. I have to thank you back too for all issues you were able to cleanly fix too, guessit is alive and here to stay for a long time thanks to you :)
Guessit is a great filename parser. It's already in use in some tools, like subliminal and Flexget, and some other well known projects like SickRage are planning to embed it in their application. I think it's now time to release guessit v1, and go ahead to next major version guessit v2.
From my (biased) point of view, it the best release name parser available. It has strengths like
But I think there are some pitfalls in guessit internals, and it's now really hard to make it better and fix remaining ssues.
For guessit v2, I plan to rewrite it from scratch. It may seems a huge work, but i've started to work on a lower level component called rebulk to register various patterns in bulk. This component already supports patterns matching in bulk, and will support some kind of rule engine to easily implements validation rules between registered patterns (ie. A regexp match must be next to another to be validated, etc ...). There's already some features required by guessit, like match formatters.
I want to make guessit2 a configuration module for rebulk, to make patterns and validation rules readable and easy to maintain. There will be no transformer anymore, only a set of patterns, processors and validation rules registered in a
Rebulk
object, and rebulk engine will do the rest.@wackou as owner of this repository and initial author, are you ok with this plan ? Do you prefer v2 to be in this repository, under a new branch, or do you want me to work on another new project ? (Actual codebase could be maintained in a 1.x branch).