plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

Database disk image malformed with multiprocessing #39

Closed jacobzweig closed 5 years ago

jacobzweig commented 5 years ago

Hi @AjayP13, I was curious if you had any examples of how you've used this with multiprocessing previously. I'm bumping into a pysqlite error when I try to run with multiprocessing:

coord = tf.train.Coordinator()
processes = []
for i in range(num_processes):
    args = (texts_sliced[i], labels_sliced[i], output_files[i], concatenated_embeddings)
    p = Process(target=_convert_shard, args=args)
    p.start()
    processes.append(p)
coord.join(processes)
  File "/home/jacob/test.py", line 454, in _convert_shard 
    text_embedding = embedding.query(text) 
  File "/home/jacob/anaconda3/pymagnitude/third_party/repoze/lru/__init__.py", line 390, in cached_wrapper                                                                 
    val = func(*args, **kwargs) 
pysqlite2.dbapi2.DatabaseError: database disk image is malformed                                                         
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 2088, in query
    for i, m in enumerate(self.magnitudes)]
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 2088, in <listcomp>
    for i, m in enumerate(self.magnitudes)] 
  File "/home/jacob/anaconda3/pymagnitude/third_party/repoze/lru/__init__.py", line 390, in cached_wrapper
    val = func(*args, **kwargs)
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 1221, in query
    vectors = self._vectors_for_keys_cached(q, normalized)
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 1109, in _vectors_for_keys_cached 
    unseen_keys[i], normalized, force=force)
  File "/home/jacob/anaconda3/pymagnitude/third_party/repoze/lru/__init__.py", line 390, in cached_wrapper 
    val = func(*args, **kwargs)
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 483, in _out_of_vocab_vector_cached
    return self._out_of_vocab_vector(*args, **kwargs)
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 992, in _out_of_vocab_vector, normalized=normalized)
  File "/home/jacob/anaconda3/pymagnitude/__init__.py", line 829, in _db_query_similar_keys_vector, params).fetchall()   
pysqlite2.dbapi2.DatabaseError: database disk image is malformed

I've tried reloading the .Magnitude files as well as setting blocking=True, but can't seem to get around it. Any ideas?

Thanks!

AjayP13 commented 5 years ago

Just a cursory glance, but it looks like you are sharing a singleconcatenated_embeddings variable by passing it as args to each Process. This might be root cause of the issues. I'm not sure how Python is sharing a variable that isn't serializable across processes, but I would imagine it is a straight memory copy byte for byte of the object. I would not share a Magnitude variable like that between processes as it contains things that should not be memory copied like that (like database references).

Instead, it's better to instantiate to Magnitude within each process (i.e. call Magnitude() constructor in each process and in your case also concatenate in each process). Magnitude utilizes memory-maps, so even though you are instantiating a Magnitude object in multiple processes, it will try to not duplicate memory where it can and your application should not take a hit on performance.

Let me know if that solves the issue, if not I'll keep digging.

jacobzweig commented 5 years ago

Sorry for delayed reply – that worked!

duhaime commented 3 years ago

@AjayP13 if you have a moment, could I please ask you to say a little bit about why copying an object across multiprocessing processes might cause a sqlite db to become corrupted? Any insights you can offer on this question would be super helpful!

AjayP13 commented 3 years ago

Hi Doug,

Yes, you should not copy it across multiple processes. This is because it will copy the SQLite connection/cursor along with it, which is not safe to do.

However, Magnitude does support multiple processes. Just instantiate a new Magnitude object in each process. They will try to share memory though memory mapping as to not duplicate resources.

On Sat, Apr 3, 2021 at 10:49 AM Douglas Duhaime @.***> wrote:

@AjayP13 https://github.com/AjayP13 if you have a moment, could I please ask you to say a little bit about why copying an object across multiprocessing processes might cause a sqlite db to become corrupted? Any insights you can offer on this question would be super helpful!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plasticityai/magnitude/issues/39#issuecomment-812875434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJHMEI7Z5QRGQISKAXMLHLTG4TGNANCNFSM4GGQF4UQ .

-- Ajay Patel | 408.348.2531

duhaime commented 3 years ago

Amen, thanks very much @AjayP13!