snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

1.9.1 release of Python package #107

Closed mitya57 closed 5 years ago

mitya57 commented 5 years ago

I think #105 is an important fix, so it deserves a new release.

I tried to list all changes since 1.9.0 in a changelog, so this fixes #96. Please let me know if I missed something important.

This is not yet uploaded to PyPI, I am awaiting your review first.

ojwb commented 5 years ago

BTW, I'm a bit dubious about _clear_cache.

I'm not actually entirely sold that a caching layer belongs here. The generated Python stemmers are much slower (except under pypy) than those any other currently supporting programming language so some sort of caching probably is currently needed for applications where stemming performance matters and every application implementing caching for themselves does seem sub-optimal, but the caching layer also hurts some uses - for example, running the test suite is slower due to this - e.g. make check_python_irish is ~3.3% faster if it's removed.

The generated code could be improved - for example, there's quite a lot of needless exception handling such as:

                    try: 
                        if not self.in_grouping(IrishStemmer.g_v, 97, 250):
                            raise lab2()
                        raise lab1()
                    except lab2: pass

which could be replaced by:

                    if self.in_grouping(IrishStemmer.g_v, 97, 250):
                        raise lab1()

Stripping out just the ones in the top-level function for the Irish stemmer yields about a 1% speed-up.

The calculation of how many entries to remove seems a bit odd - it appears to cull 80% of the threshold, but that might not actually bring us down under the threshold (consider if stemWords() is called with a million unique words - it'll stem and cache each in turn, then cull just 8000 of them before returning).

It also seems to perform a sort of all the current entries to decide which to cull, which seems like it's going to be O(n*log(n)) where n is maxCacheSize (which could reasonably be set almost arbitrarily large) and to require a lot of scratch space. In C++ std::partition instead of sorting would give O(n) - not sure if there's a good way to achieve that in Python (I guess the partition algorithm could be implemented in Python code, but that would probably be slow compared to sorting code implemented in C in the Python interpreter.)

ojwb commented 5 years ago

110 eliminates one case of redundant exception handling, and speeds up make check_python by ~2%.

mitya57 commented 5 years ago

I agree — these except constructions are very unoptimal. However as these constructions were designed for C-like goto calls, it is not easy to translate them one-to-one to Python. I have no knowledge of snowball code so I'm afraid I won't be able to help here :(

Anyway, as you merged this branch and changelog has today's date, I pushed the current master (6169ef2456afeab15d53748627c139d7d022149e) as 0.9.1 to PyPI.

With regards to std::partition, I don't know any similar built-in in Python (closest are built-in filter and itertools.filterfalse), but it should be easy to implement it.

Finally, I think users who care about performance should probably use PyStemmer instead.

ojwb commented 5 years ago

these constructions were designed for C-like goto calls

I'm not sure they really were - the signals mechanism comes from SNOBOL which predates C by a number of years.

But it certainly is true that they don't map into Python code as easily as to languages with either goto or a labelled-break.

There was one case we were generating needlessly complex code for which I've fixed in 067250d52568a7a5ccba26d77254c798e3a3a6a7.

Finally, I think users who care about performance should probably use PyStemmer instead.

Yes, that's a good point - if performance matters much at all that's a much better choice. So perhaps we should just retire the caching here.