Closed LostInDarkMath closed 3 years ago
Thanks for the report. You're right, this is very slow. I've been testing with the English WordNet, where it is fast. So I think the problem is the expanded queries.
The path
metric (as well as others) need to calculate hypernym paths. Following these relations in a lexicon that includes them directly (like the English WordNet) is fast, but when the relations are borrowed from another resource (as the German wordnet is doing here; borrowing relations from the English WordNet to provide its structure), the calculations are slowed down a lot, as they first need to traverse the Interlingual Index (ILI) to find relations, then back again to get target-lexicon synsets.
There are some potential remedies (maybe more than one is necessary):
There is currently no way to set a timeout, but you could probably set one up yourself, maybe with something like this.
Looking into this a bit deeper, I think there is actually something buggy. If you have two lexicons with the same relations (e.g., you have both the English WordNet and the Princeton WordNet loaded), then the number of hypernym paths is greatly expanded. For example, below, if you just ask for the hypernym paths with a specific lexicon, there is only one path:
>>> for path in wn.synset('ewn-00023280-n', lexicon='ewn:2020').hypernym_paths():
... print(path)
...
[Synset('ewn-00002137-n'), Synset('ewn-00001740-n')]
But if you specify multiple lexicons or a language shared by multiple lexicons, you get more than expected:
>>> for path in wn.synset('ewn-00023280-n', lang='en').hypernym_paths():
... print(path)
...
[Synset('pwn-00002137-n'), Synset('ewn-00001740-n')]
[Synset('pwn-00002137-n'), Synset('pwn-00001740-n')]
[Synset('ewn-00002137-n'), Synset('ewn-00001740-n')]
[Synset('ewn-00002137-n'), Synset('pwn-00001740-n')]
Notice that some paths have synsets from both lexicons. The number of paths depends on the number of synsets shared by matching lexicons, which can get very large for longer paths.
So a workaround for now is to explicitly set the expand-set for your wordnet object:
>>> de = wn.Wordnet(lang='de', expand='ewn:2020')
>>> ss1 = de.synsets('Anarchie')[0]
>>> ss2 = de.synsets('Fotografie')[0]
>>> from wn import similarity
>>> similarity.path(ss1, ss2) # this is quick; see below
0.1
>>> import timeit
>>> timeit.timeit('similarity.path(ss1, ss2)', globals=globals(), number=1) # reports the time in seconds
0.10787162600172451
I'll change this issue to a bug. While I still hope to do the enhancement for #38, maybe we can get away with not doing the others.
Thanks for your detailed answers! I forgot to mention that I'm using only the german odenet
1.3. Therefore, the expand
keyword does not work for me. Do you have an idea how I can fix this?
The timeout-decorator might be an option, but it would be great if there is a better way.
only odenet? That might be a different issue, then. I'll look into it. And thanks for your patience; this part isn't thoroughly tested yet.
Yes, I only downloaded odenet
.
I've opened #90 for the multi-lexicon issue I detailed above. When you only have OdeNet, it looks like a data problem, so I opened hdaSprachtechnologie/odenet#20. The problem is about the same: things are slow because of a combinatoric explosion of hypernym paths. To be honest, now that I have more information, I'm not sure why my workaround worked above...
I'm going to close this issue because it's not really that path similarity is slow, it's the combinatoric issue with hypernym paths, and see the issues linked above for those.
For me as a user, performance is very important. Surprisingly, I never had any performance problems with WordNet from NLTK. I also used the path similarity there and I made thousands of requests. Are they doing something different?
For me as a user, performance is very important.
Rest assured, me closing this issue doesn't mean that resolving the issue isn't important. It's just that the other issues more accurately describe the problem(s) involved. I'm working on this issue right now.
Surprisingly, I never had any performance problems with WordNet from NLTK. I also used the path similarity there and I made thousands of requests. Are they doing something different?
The NLTK only had the Princeton WordNet (PWN) taxonomic structure, and all the OMW wordnets were just words attached to the PWN synsets (that is, they were built with the "expand" methodology by translating English words). So if you searched, say, the French wordnet (the OMW doesn't have German yet) for "chien", you'd get the exact same synset as searching for "dog" in English. This means that the wordnets of all other languages are limited to the synset structure of PWN (you're out of luck if your language has a concept that's not in English), but it also means that the other languages can benefit from the hypernym/hyponym relations of PWN and, fortunately, the taxonomic structure of PWN is fairly well-behaved. The new OdeNet is not built with the expand methodology, and it's fairly new, so there are still some problems with the data.
Wn, by contrast, treats every wordnet as an independent resource. Each has its own synset structure (even if it was built via the expand methodology), which better enables them to grow in a way that's natural for the language. If you want to borrow relation from the PWN, you can be explicit about which wordnet (and which version) you use to do so. This functionality is new and, as we've seen, a bit tricky to get right.
Finally, the backend architecture of the NLTK and Wn are drastically different. The NLTK reads the wordnet data from text files every time you start it up, and it holds much of that info in memory, while Wn uses a SQLite database. This gives them different performance profiles. For instance, startup and initial queries with Wn are much faster:
$ alias bench='/usr/bin/time -f" time\t%es\n mem\t%MK" "$@"'
$ bench python -c 'import wn; wn.synsets("chien", lang="fr")'
time 0.22s
mem 24084K
$ bench python -c 'from nltk.corpus import wordnet; wordnet.synsets("chien", lang="fra")'
time 2.23s
mem 157920K
But repeated queries (such as finding hypernym paths) can be slower, as the NLTK holds everything in memory. I'm working to improve Wn's performance here.
Thank you very much for the detailed answer! Now I understand better how it works :)
@LostInDarkMath in case you're not following the other issues, the bad performance seems to be reasonably resolved in the latest version (0.5.0
)
$ python -m timeit -s '
import wn
w = wn.Wordnet(lexicon="odenet")
s1 = w.synsets("Anarchie")[0]
s2 = w.synsets("Fotografie")[0]
from wn import similarity' 'similarity.path(s1, s2)'
2000 loops, best of 5: 193 usec per loop
Thank you! I'll check it out later :+1:
Version 0.5? Odenet was on version 1.3 before. So how can I update?
(venv) D:\Projekte\Github\source\backend_prototyp>python -m wn lexicons
odenet 1.3 [de] Offenes Deutsches WordNet
I tried to remove it with wn.remove('odenet')
and I cleared the cache .wndata/downloads
and reinstalled it via wn.download(project_or_url='odenet')
but it didn't changed anything.
Sorry, I mean v0.5.0 of this library, Wn.
> python -m pip install -U wn
The German data still has some problems, but hopefully Wn won't get stuck as easily as before.
Yes, now it works as desired! Thankl you very much! :smiley:
Hi there! I have a problem with wn. The calculation of the path similarity takes extremly long.
Minimal example:
What is the problem here? Does the graph searching just takes too long? And is there way to set a timeout value or something like that?
Best, Willi