reymond-group / tmap

A very fast visualization library for large, high-dimensional data sets.
http://tmap.gdb.tools
212 stars 30 forks source link

Enable pickling of tmap.VectorUint #8

Open rajarshi opened 4 years ago

rajarshi commented 4 years ago

Hi, I'm trying to compute MHFP's in parallel and index them in a parallel fashion. The code looks like

def fp_function(pair):
    smi, molid = pair
    mol = AllChem.MolFromSmiles(smi)
    fp = tmap.VectorUint(enc.encode_mol(mol, min_radius=0))
    return molid, fp

num_cores = multiprocessing.cpu_count()
fps = Parallel(n_jobs=num_cores)(delayed(fp_function)(input_pair) for input_pair in molcsv)
pickle.dump(fps, open("libcomp_fps.pkl", 'wb'))

However on running this I get

Traceback (most recent call last):
  File "/Users/guha/src/tmap/lsh_query.py", line 27, in <module>
    fps = Parallel(n_jobs=num_cores)(delayed(fp_function)(input_pair) for input_pair in molcsv)
  File "/anaconda3/envs/rdkit/lib/python3.6/site-packages/joblib/parallel.py", line 996, in __call__
    self.retrieve()
  File "/anaconda3/envs/rdkit/lib/python3.6/site-packages/joblib/parallel.py", line 899, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/anaconda3/envs/rdkit/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 517, in wrap_future_result
    return future.result(timeout=timeout)
  File "/anaconda3/envs/rdkit/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/anaconda3/envs/rdkit/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: can't pickle tmap.VectorUint objects

Are there plans to make VectorUint pickleable? Or is there an alternative approach to parallelizing this type of computation?

aparente-nurix commented 4 years ago

I had asked @daenuprobst this question a while back (not on github, since I can't seem to find it on here) and the answer was to write them to a txt file. However, I've found writing/reading the fingerprints from text files can be quite slow. Would json be a better option? I have not tried this.

rajarshi commented 4 years ago

One possibility is to look at supporting the chemfp binary format, which allows for impressively high speed I/O