intgr / topy

Python script to fix typos in text, based on the RegExTypoFix project from Wikipedia and AutoWikiBrowser
Other
35 stars 11 forks source link

add caching #5

Closed berendt closed 8 years ago

berendt commented 10 years ago

Loading the regs everytime doesn't make sense.

intgr commented 10 years ago

What kind of caching do you have in mind? I guess I could cache the intermediate parser output from BeautifulSoup, but that's a relatively small part compared to the time to compile regexes.

The regexes are compiled only once per execution. AFAICT there is no way to persist the compiled object across multiple runs. Even pickle.dumps() on a regex object has to compile again on load, so it's not a win.

Instead of caching BeautifulSoup output, I'd prefer switching to a faster parser instead so I could lose the dependency and it wouldn't be an issue at all.

berendt commented 10 years ago

Hello Marti.

I wrote a full rewrite of your script available at https://gist.github.com/berendt/5ae38f2f1d5bd6b883d3 at the moment and tested it with some OpenStack repositories. Please have a look if you want to replace the current version of Topy with this rewrite. It has clean logging an argument parsing and it uses regex instead of re. Using regex it's possible to compile all regular expressions. With Python2 I had > 500 failed compilations.

I used Pickle to write/read the compiled Regexs. Thought that it increased the loading time, but I have not measured it.

Christian.

Christian Berendt Cloud Computing Solution Architect Mail: berendt@b1-systems.de

B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

intgr commented 10 years ago

I profiled it on my laptop and the total startup time is 2.1s, of which 1.7 is spent in re.compile, the rest is almost all BeautifulSoup. It is significant, but I'd explore the possibility of changing the parser first.

Thanks for going through the effort to rewrite it. I'll definitely adopt some things from it, but I'll have to think whether I want it whole. I have some reservations like with logging. If it's a simple command-line app and not a daemon, people probably don't care about timestamps and log levels.

berendt commented 10 years ago

That's okay for me. You don't have to take the whole rewrite. Just drop me a line if you're finished and I'll drop the rewrite from Gist.

I think that's not worth to have caching implemented if there is not really a benefit in the loading time.

berendt commented 8 years ago

Cleaning up my open issues, I will close this issue.

intgr commented 8 years ago

@berendt Thanks, I have opened a new issue #16 "Speed up ruleset parsing" based on this.