plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

Magnitude queries extremely slow for some queries with medium model. #15

Closed daphnei closed 6 years ago

daphnei commented 6 years ago

Also, I also don't seem to be getting the following advantage described in the documentation: "Moreover, memory maps are cached between runs so even after closing a process, speed improvements are reaped."

See the following log.

$ python
Python 3.4.6 (default, Mar 22 2017, 12:26:13) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pymagnitude import Magnitude
>>> vectors = Magnitude('/nlp/data/embeddings_magnitude/eng/GoogleNews-vectors-negative300.magnitude.medium')
>>> from timeit import timeit
>>> timeit('vectors.query(\'cat\')', 'from __main__ import vectors', number=1)
0.0585936838760972
>>> timeit('vectors.query(\'food\')', 'from __main__ import vectors', number=1)
0.03608247195370495
>>> timeit('vectors.query(\'believe\')', 'from __main__ import vectors', number=1)
0.02389267599210143
>>> timeit('vectors.query(\'denormalization\')', 'from __main__ import vectors', number=1)
27.955912864999846
>>> timeit('vectors.query(\'tariffication\')', 'from __main__ import vectors', number=1)
36.63970931386575
>>> timeit('vectors.query(\'tariffication\')', 'from __main__ import vectors', number=1)
7.962598465383053e-05
>>> exit()
$ python
Python 3.4.6 (default, Mar 22 2017, 12:26:13) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pymagnitude import Magnitude
>>> vectors = Magnitude('/nlp/data/embeddings_magnitude/eng/GoogleNews-vectors-negative300.magnitude.medium')
>>> from timeit import timeit
>>> timeit('vectors.query(\'tariffication\')', 'from __main__ import vectors', number=1)
34.75812460412271
>>>

I understand that some queries (especially OOV ones) should be slower than others, but 36 seconds seems excessive. This issue doesn't affect all out-of-vocabulary words. For example:

>>> timeit('vectors.query(\'catdogcow\')', 'from __main__ import vectors', number=1)
1.1214001160115004
>>> 'catdogcow' in vectors
False

Is there anything I can do to get all queries run within some reasonable threshold, say 2 seconds, or to get caching to work? Maybe there should be some feature where if OOV querying is taking too long, a random vector, like for the light model, is returned?

AjayP13 commented 6 years ago

Thanks for reporting this, I've actually noticed this too for some queries. I'll see if there's anything I can do to speed up these queries and if not I'll add a 2 second timeout like you suggested.

AjayP13 commented 6 years ago

Sorry it took so long, I did add something in Release 0.1.56 to control the time it takes by limiting the search. The timeout is not as easy to implement as I thought as SQLite has no in-built timeout for queries.

I also saw you mentioned memory-map caches weren't working for you in-between runs. They likely are. Memory maps are used for kNN-search and those are cached in between runs so you don't have to build them each time. (Takes > 1 min each run otherwise). What you are experiencing likely is some "queries" being a little slower on a new run, that is normal however. If you are experiencing extremely slow kNN-search on the first kNN-search of every run, then your memory map caches aren't working. These are cached to a $TMPDIR on Linux/Mac. If you are using a Docker container or a VM solution, you might be losing this temporary directory every time. You can volume out that temp directory to a folder on your host directory if that is the case.