goodmami / wn

A modern, interlingual wordnet interface for Python
https://wn.readthedocs.io/
MIT License
209 stars 20 forks source link

Path similarity is extremly slow #88

Closed LostInDarkMath closed 3 years ago

LostInDarkMath commented 3 years ago

Hi there! I have a problem with wn. The calculation of the path similarity takes extremly long.

Minimal example:

import wn
from wn import similarity

wn_instance = wn.Wordnet(lang='de')

sets_1 = wn_instance.synsets('Anarchie')
sets_2 = wn_instance.synsets('Fotografie')
print(sets_1)
print(sets_2)
w1 = sets_1[0]
w2 = sets_2[0]
print(similarity.path(w1, w2))  # this is extreme slow

What is the problem here? Does the graph searching just takes too long? And is there way to set a timeout value or something like that?

Best, Willi

goodmami commented 3 years ago

Thanks for the report. You're right, this is very slow. I've been testing with the English WordNet, where it is fast. So I think the problem is the expanded queries.

The path metric (as well as others) need to calculate hypernym paths. Following these relations in a lexicon that includes them directly (like the English WordNet) is fast, but when the relations are borrowed from another resource (as the German wordnet is doing here; borrowing relations from the English WordNet to provide its structure), the calculations are slowed down a lot, as they first need to traverse the Interlingual Index (ILI) to find relations, then back again to get target-lexicon synsets.

There are some potential remedies (maybe more than one is necessary):

There is currently no way to set a timeout, but you could probably set one up yourself, maybe with something like this.

goodmami commented 3 years ago

Looking into this a bit deeper, I think there is actually something buggy. If you have two lexicons with the same relations (e.g., you have both the English WordNet and the Princeton WordNet loaded), then the number of hypernym paths is greatly expanded. For example, below, if you just ask for the hypernym paths with a specific lexicon, there is only one path:

>>> for path in wn.synset('ewn-00023280-n', lexicon='ewn:2020').hypernym_paths():
...   print(path)
... 
[Synset('ewn-00002137-n'), Synset('ewn-00001740-n')]

But if you specify multiple lexicons or a language shared by multiple lexicons, you get more than expected:

>>> for path in wn.synset('ewn-00023280-n', lang='en').hypernym_paths():
...   print(path)
... 
[Synset('pwn-00002137-n'), Synset('ewn-00001740-n')]
[Synset('pwn-00002137-n'), Synset('pwn-00001740-n')]
[Synset('ewn-00002137-n'), Synset('ewn-00001740-n')]
[Synset('ewn-00002137-n'), Synset('pwn-00001740-n')]

Notice that some paths have synsets from both lexicons. The number of paths depends on the number of synsets shared by matching lexicons, which can get very large for longer paths.

So a workaround for now is to explicitly set the expand-set for your wordnet object:

>>> de = wn.Wordnet(lang='de', expand='ewn:2020')
>>> ss1 = de.synsets('Anarchie')[0]
>>> ss2 = de.synsets('Fotografie')[0]
>>> from wn import similarity
>>> similarity.path(ss1, ss2)  # this is quick; see below
0.1
>>> import timeit
>>> timeit.timeit('similarity.path(ss1, ss2)', globals=globals(), number=1)  # reports the time in seconds
0.10787162600172451

I'll change this issue to a bug. While I still hope to do the enhancement for #38, maybe we can get away with not doing the others.

LostInDarkMath commented 3 years ago

Thanks for your detailed answers! I forgot to mention that I'm using only the german odenet 1.3. Therefore, the expand keyword does not work for me. Do you have an idea how I can fix this?

The timeout-decorator might be an option, but it would be great if there is a better way.

goodmami commented 3 years ago

only odenet? That might be a different issue, then. I'll look into it. And thanks for your patience; this part isn't thoroughly tested yet.

LostInDarkMath commented 3 years ago

Yes, I only downloaded odenet.

goodmami commented 3 years ago

I've opened #90 for the multi-lexicon issue I detailed above. When you only have OdeNet, it looks like a data problem, so I opened hdaSprachtechnologie/odenet#20. The problem is about the same: things are slow because of a combinatoric explosion of hypernym paths. To be honest, now that I have more information, I'm not sure why my workaround worked above...

I'm going to close this issue because it's not really that path similarity is slow, it's the combinatoric issue with hypernym paths, and see the issues linked above for those.

LostInDarkMath commented 3 years ago

For me as a user, performance is very important. Surprisingly, I never had any performance problems with WordNet from NLTK. I also used the path similarity there and I made thousands of requests. Are they doing something different?

goodmami commented 3 years ago

For me as a user, performance is very important.

Rest assured, me closing this issue doesn't mean that resolving the issue isn't important. It's just that the other issues more accurately describe the problem(s) involved. I'm working on this issue right now.

Surprisingly, I never had any performance problems with WordNet from NLTK. I also used the path similarity there and I made thousands of requests. Are they doing something different?

The NLTK only had the Princeton WordNet (PWN) taxonomic structure, and all the OMW wordnets were just words attached to the PWN synsets (that is, they were built with the "expand" methodology by translating English words). So if you searched, say, the French wordnet (the OMW doesn't have German yet) for "chien", you'd get the exact same synset as searching for "dog" in English. This means that the wordnets of all other languages are limited to the synset structure of PWN (you're out of luck if your language has a concept that's not in English), but it also means that the other languages can benefit from the hypernym/hyponym relations of PWN and, fortunately, the taxonomic structure of PWN is fairly well-behaved. The new OdeNet is not built with the expand methodology, and it's fairly new, so there are still some problems with the data.

Wn, by contrast, treats every wordnet as an independent resource. Each has its own synset structure (even if it was built via the expand methodology), which better enables them to grow in a way that's natural for the language. If you want to borrow relation from the PWN, you can be explicit about which wordnet (and which version) you use to do so. This functionality is new and, as we've seen, a bit tricky to get right.

Finally, the backend architecture of the NLTK and Wn are drastically different. The NLTK reads the wordnet data from text files every time you start it up, and it holds much of that info in memory, while Wn uses a SQLite database. This gives them different performance profiles. For instance, startup and initial queries with Wn are much faster:

$ alias bench='/usr/bin/time -f" time\t%es\n mem\t%MK" "$@"'
$ bench python -c 'import wn; wn.synsets("chien", lang="fr")'
 time   0.22s
 mem    24084K
$ bench python -c 'from nltk.corpus import wordnet; wordnet.synsets("chien", lang="fra")'
 time   2.23s
 mem    157920K

But repeated queries (such as finding hypernym paths) can be slower, as the NLTK holds everything in memory. I'm working to improve Wn's performance here.

LostInDarkMath commented 3 years ago

Thank you very much for the detailed answer! Now I understand better how it works :)

goodmami commented 3 years ago

@LostInDarkMath in case you're not following the other issues, the bad performance seems to be reasonably resolved in the latest version (0.5.0)

$ python -m timeit  -s '
import wn
w = wn.Wordnet(lexicon="odenet")
s1 = w.synsets("Anarchie")[0]
s2 = w.synsets("Fotografie")[0]
from wn import similarity' 'similarity.path(s1, s2)'       
2000 loops, best of 5: 193 usec per loop
LostInDarkMath commented 3 years ago

Thank you! I'll check it out later :+1:

LostInDarkMath commented 3 years ago

Version 0.5? Odenet was on version 1.3 before. So how can I update?

(venv) D:\Projekte\Github\source\backend_prototyp>python -m wn lexicons
odenet  1.3     [de]    Offenes Deutsches WordNet

I tried to remove it with wn.remove('odenet') and I cleared the cache .wndata/downloads and reinstalled it via wn.download(project_or_url='odenet') but it didn't changed anything.

goodmami commented 3 years ago

Sorry, I mean v0.5.0 of this library, Wn.

> python -m pip install -U wn

The German data still has some problems, but hopefully Wn won't get stuck as easily as before.

LostInDarkMath commented 3 years ago

Yes, now it works as desired! Thankl you very much! :smiley: