plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

Magnitude throws DatabaseError when instantiated from a .vec file #8

Closed johnfarina closed 6 years ago

johnfarina commented 6 years ago

Possibly related to #2.

This happens to me when trying to create a Magnitude object from a .vec file, on both linux and macOS, under python 3.6.

Both self.fd and self.path seem to be set correctly, but something breaks in the call to self._db(). Here's an exerpt from a minimal session using one of the pretrained fasttext .vec files

In [3]: Magnitude("wiki-news-300d-1M.vec")
DatabaseError                             Traceback (most recent call last)
<ipython-input-3-d79f82a84c59> in <module>()
----> 1 Magnitude("wiki-news-300d-1M.vec")
/usr/local/lib/python3.6/site-packages/pymagnitude/__init__.py in __init__(self, path, lazy_loading, blocking, use_numpy, case_insensitive, pad_to_length, truncate_left, pad_left, placeholders, ngram_oov, supress_warnings, batch_size, eager, dtype, _namespace, _number_of_values)
    195         # Get metadata about the vectors
    196         self.length = self._db().execute(
--> 197             "SELECT COUNT(key) FROM magnitude") \
    198             .fetchall()[0][0]
    199         self.original_length = self._db().execute(

DatabaseError: file is not a database

Popping into the debugger, self.path points to the temporary .magnitude file, and instantiating a Magnitude object directly from there works just fine.

ipdb> p self.path
'/var/folders/vc/70hjbvsd3pq85t7x693365rc0000gn/T/c620a226ebc2c24c12ee1be5127d1448.magnitude'
ipdb> Magnitude(self.path)
<pymagnitude.Magnitude object at 0x10f2dea58>

So I modified the definition of self._db in __init__.py to just use

conn = sqlite3.connect(self.path,
          check_same_thread=False)

irrespective of the OS, and that seemed to solve the problem. There's likely a better way of doing things though. (I haven't looked through the code very carefully, and I don't totally understand the nuances of /dev/fd/)

AjayP13 commented 6 years ago

Hi John,

Thanks for reporting this. Yep, this is an issue with the self.fd not being properly pointed to the newly converted .magnitude file (it was still pointing to the .vec file). The reason why self.fd is used and not self.path is that by using self.fd we can ensure SQLite doesn't modify the .magnitude file by giving it read only permissions.

Here's the commit and build pipeline that fixes this issue: https://gitlab.com/Plasticity/magnitude/commit/0a182875fa6ca8adf79316559ad761889fa4264e

Once the tests pass on the CI, it will be deployed as version 0.1.16 on PyPI. You can upgrade pymagnitude on your system with pip3 install pymagnitude -U. Please post back here if it resolved your issue or add a :+1: and I'll close out this issue.

johnfarina commented 6 years ago

That commit fixed the bug. Thanks for the quick turnaround!