guessit-io / guessit

GuessIt is a python library that extracts as much information as possible from a video filename.
https://guessit-io.github.io/guessit
GNU Lesser General Public License v3.0
821 stars 91 forks source link

Use Aho-Corasick algorithm to enhance performance #389

Open Toilal opened 7 years ago

Toilal commented 7 years ago

Guessit could use Aho-Corasick algorithm to enhance performance.

I think it could be implemented into rebulk, by merging all string patterns in a single Aho-Corasick pass.

Toilal commented 7 years ago

Just tested with pyahocorasick, but it makes things a bit slower when running guessit unit tests :(

See rebulk branch for implementation : https://github.com/Toilal/rebulk/pull/9

ratoaq2 commented 7 years ago

Are you profiling guessit?

I have a scenario with 1000 release names and profiling it using cProfile/snakeviz

For the current development version and those 1000 release names I get:

From the graph in snakeviz I would say that string search is just a small slice. Maybe worth testing it with a huge expected_title list.

The biggest part in match_patterns is related to chains (most likely the episodes patterns) and 1/4 of the part in execute_rules is related to toposort.

Maybe some improvements could be done when using guessit as an API (like an application context, where rules could be already sorted. It's just an example)

And some research around regexes could be done as well. Check if we could benefit from some algorithm that compiles all regexes to an automata (NFA / DFA) and just iterate through the input string once. But the profiling results should always be driving it, so we can focus on areas were improvements can be noticeable.

Toilal commented 7 years ago

Again, you are perfectly right :)

I though string search was a bottleneck, but it isn't and I should have used a profiler. I'm more used to Java profiler though, but i'll give a try to snakeviz.

It would be great to find something that compiles all regexes in an automata like Ago-Corasick, but I didn't find any library doing this ... And I'm not an algorithm expert :).

Maybe we could implement our own kind of simple pattern matching that could be faster than regex (maybe ?) ... Most regex are really simple and just there to replace the separators ...

I also found this post that may be valuable to optimize some regex : .

I'll install this snakeviz and start analyze, but it would be great to know exactly which regex consumes too much time. Maybe I could implement a "performance analysis" mode in rebulk that would display performance for each pattern ...

ratoaq2 commented 7 years ago

I'm a java guy as well that uses Yourkit to profile applications.

I only used python profiling one time. You don't need to install anything, just run your script using -m cProfile:

python -m cProfile -o report.prof myscript.py

Then you could install snakeviz to take a look at the results (report.prof). I tried line_profiler as well and it works quite well when you need to understand the performance of a certain method.

And I think rebulk could capture statistics as well... So you can evaluate how many times and how long it took for each pattern to be evaluated. And that could be applied to the rules as well.

milahu commented 7 months ago

I have a scenario with 1000 release names

6 million input strings here guessit makes my script about 100x slower... not usable, this turns my script into cpuburn.py

maybe a tree-sitter parser would be faster, at least that would be native code

https://github.com/guessit-io/guessit/issues/347#issuecomment-252488975

Guessit is not designed to be fast. I made the choice to make it extendable

nice idea, but not usable to parse millions of strings


i need this for full text search over release titles, ignoring all other fields

my inputs are: title, year, release_name and i only need parsed.title, so im using the shortcut

release = "Dont.Look.Up.2009.BDRip.XviD-FRAGMENT"
title = "Don't Look Up"
year = 2009
#release_title = guessit.guessit(release)["title"] # too slow
release_title = release.split(str(year), 1)[0] # Dont.Look.Up.