samsledje / ConPLex

Adapting protein language models and contrastive learning for highly-accurate drug-target interaction prediction.
http://conplex.csail.mit.edu
MIT License
124 stars 33 forks source link

Formatting of prediction tsv #35

Open E0287979 opened 9 months ago

E0287979 commented 9 months ago

Is there any specific format requirement for prediction tsv? I am able to run the predict function when I use the tsv within the repository.

I am getting error when I tried to predict on a file I have generated using surfaceome cayman as the backbone.

Traceback (most recent call last): File "/anaconda3/envs/conplex-dti/bin/conplex-dti", line 6, in sys.exit(main()) File "/ConPLex/conplex_dti/main.py", line 41, in main args.main_func(args) File "/ConPLex/conplex_dti/cli/predict.py", line 104, in main drug_featurizer.preload(query_df["moleculeSmiles"].unique()) File "/ConPLex/conplex_dti/featurizer/base.py", line 162, in preload if seq in h5fi: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/anaconda3/envs/conplex-dti/lib/python3.9/site-packages/h5py/_hl/group.py", line 514, in contains return h5g._path_valid(self.id, self._e(name), self._lapl) File "/anaconda3/envs/conplex-dti/lib/python3.9/site-packages/h5py/_hl/base.py", line 206, in _e raise TypeError(f"A name should be string or bytes, not {type(name)}") TypeError: A name should be string or bytes, not <class 'float'>

I think features returned NaN

samsledje commented 2 months ago

Sorry for the late response on this-- complex expects files formatted as in https://github.com/samsledje/ConPLex/blob/main/tests/toy_predict.tsv, that is a tab-separated file with columns for the protein/molecule identifiers and then descriptions as sequences/SMILES strings. One common error with tab-separated files is, if written by hand, that the tab character is actually four spaces, which isn't parsed properly. Make sure you're using a proper tab / \t when creating this file.