Open Toilal opened 7 years ago
Just tested with pyahocorasick, but it makes things a bit slower when running guessit unit tests :(
See rebulk branch for implementation : https://github.com/Toilal/rebulk/pull/9
Are you profiling guessit?
I have a scenario with 1000 release names and profiling it using cProfile/snakeviz
For the current development version and those 1000 release names I get:
From the graph in snakeviz I would say that string search is just a small slice. Maybe worth testing it with a huge expected_title
list.
The biggest part in match_patterns
is related to chains
(most likely the episodes patterns)
and 1/4 of the part in execute_rules
is related to toposort.
Maybe some improvements could be done when using guessit as an API (like an application context, where rules could be already sorted. It's just an example)
And some research around regexes could be done as well. Check if we could benefit from some algorithm that compiles all regexes to an automata (NFA / DFA) and just iterate through the input string once. But the profiling results should always be driving it, so we can focus on areas were improvements can be noticeable.
Again, you are perfectly right :)
I though string search was a bottleneck, but it isn't and I should have used a profiler. I'm more used to Java profiler though, but i'll give a try to snakeviz.
It would be great to find something that compiles all regexes in an automata like Ago-Corasick, but I didn't find any library doing this ... And I'm not an algorithm expert :).
Maybe we could implement our own kind of simple pattern matching that could be faster than regex (maybe ?) ... Most regex are really simple and just there to replace the separators ...
I also found this post that may be valuable to optimize some regex : .
I'll install this snakeviz and start analyze, but it would be great to know exactly which regex consumes too much time. Maybe I could implement a "performance analysis" mode in rebulk that would display performance for each pattern ...
I'm a java guy as well that uses Yourkit to profile applications.
I only used python profiling one time.
You don't need to install anything, just run your script using -m cProfile
:
python -m cProfile -o report.prof myscript.py
Then you could install snakeviz to take a look at the results (report.prof).
I tried line_profiler
as well and it works quite well when you need to understand the performance of a certain method.
And I think rebulk could capture statistics as well... So you can evaluate how many times and how long it took for each pattern to be evaluated. And that could be applied to the rules as well.
I have a scenario with 1000 release names
6 million input strings here guessit makes my script about 100x slower... not usable, this turns my script into cpuburn.py
maybe a tree-sitter parser would be faster, at least that would be native code
https://github.com/guessit-io/guessit/issues/347#issuecomment-252488975
Guessit is not designed to be fast. I made the choice to make it extendable
nice idea, but not usable to parse millions of strings
i need this for full text search over release titles, ignoring all other fields
my inputs are: title, year, release_name and i only need parsed.title, so im using the shortcut
release = "Dont.Look.Up.2009.BDRip.XviD-FRAGMENT"
title = "Don't Look Up"
year = 2009
#release_title = guessit.guessit(release)["title"] # too slow
release_title = release.split(str(year), 1)[0] # Dont.Look.Up.
Guessit could use Aho-Corasick algorithm to enhance performance.
I think it could be implemented into rebulk, by merging all string patterns in a single Aho-Corasick pass.