kexinhuang12345 / DeepPurpose

A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)
https://doi.org/10.1093/bioinformatics/btaa1005
BSD 3-Clause "New" or "Revised" License
945 stars 270 forks source link

Drug propery prediction format error #148

Closed Davegdd closed 1 year ago

Davegdd commented 1 year ago

Hello and thanks a lot for this wonderful software.

I'm trying to perform drug property prediction using exactly the same code as in Case Study 1(b) but with my own data in format:

Drug1_SMILES Score/Label Drug2_SMILES Score/Label ...

and using:

from DeepPurpose import dataset
X_drug, y = dataset.read_file_compound_property(PATH)

but keep getting the following error:

Drug Property Prediction Mode...
in total: 95520 drugs
encoding drug...
unique drugs: 2
descriptastorus not found this smiles: 0 convert to all 0 features
descriptastorus not found this smiles: 1 convert to all 0 features
Done.
Let's use 1 GPU!
--- Data Preparation ---
--- Go for Training ---

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-40-f05f1f26eee9>](https://localhost:8080/#) in <module>
     20                         )
     21 model = models.model_initialize(**config)
---> 22 model.train(train, val, test)
     23 
     24 X_repurpose, drug_name, drug_cid = load_broad_repurposing_hub(SAVE_PATH)

[/usr/local/lib/python3.7/dist-packages/DeepPurpose/CompoundPred.py](https://localhost:8080/#) in train(self, train, val, test, verbose)
    341 
    342                                 score = self.model(v_d)
--> 343                                 label = Variable(torch.from_numpy(np.array(label)).float()).to(self.device)
    344 
    345                                 if self.binary:

TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

If instead of the abovementioned format I use: Score SMILES, it does recognize and properly encodes all drugs but I still get the TypeError: can't convert ... etc

So not sure if there is any issue with the format. Thanks for any help.

kexinhuang12345 commented 1 year ago

can you bring X_drug and y to see if X_drug is the list of SMILES and y the list of labels?

Davegdd commented 1 year ago

This is the file I'm using. H5N1_dataset_bioarray_small.txt

I got it to work by doing:

y, X_drugs = dataset.read_file_compound_property(PATH)
y = y.astype(float)

So it seems there were two issues: it was extracting x as y and viceversa; and it was getting the labels as strings and not floats. Not sure if I was doing something wrong.

kexinhuang12345 commented 1 year ago

Thanks! can the demo code work in your end? https://github.com/kexinhuang12345/DeepPurpose/blob/master/DEMO/load_data_tutorial.ipynb

Davegdd commented 1 year ago

The data processing demo code seems to work, thanks.