hkmztrk / DeepDTA

215 stars 107 forks source link

What does the data mean? #1

Closed Running-z closed 4 years ago

Running-z commented 5 years ago

Hello, I am concerned about the work you do, and then I try to use my data, but the data under your data file, I do not quite understand what it means, can provide a data description? For example, data/davis What do you mean by each txt file under the file? Can you give me a guide? Thank you.

hkmztrk commented 5 years ago

Hello, sorry for the inconvenience. I'll try to update the readme section as soon possible.

So actually, in order to be able to run the code, you basically need:

davis_ligands.txt -> it contains the sequences of drugs (SMILES) davis_proteins.txt -> it contains the sequences of proteins in the data Y -> contains the binding affinity matrix in pickle form (drugs x proteins in the same order given in the above text files)

folds folder contains the folds given for training and test set. setting numbers 1-2-3 indicate the type of the prediction problem (in the paper we used setting 1)

if you want to use the model with your own data you need to prepare the inputs and folds beforehand similar to the examples given here. I will try also try to include a helper to make these transformations easier. It might take a few days.

Running-z commented 5 years ago

@hkmztrk Ok, thank you, I am looking forward to your update, I really need to train my own data, I can be your user, and give back some practical questions to you.

In addition, I also want to know what you mean --problem_type,1,2,3, what do you mean? What kind of problem? Finally, when I use your data and examples to train, I find that the model is always training. I obviously set up training for 100 times, but it seems to enter a loop, and the--checkpoint_path parameter does not seem to save any files. Can you give me some advice?

hkmztrk commented 5 years ago

@Running-z That'll be really cool for me as well if you provide your feedback! I'm currently travelling so it might take a little time to make a really useful update.

But if you are in hurry, I suggest you to prepare your data accordingly to examples. You need to convert your list of input sequences to json lists and for binding affinity, you need to save your interaction values in pickle (drugs x proteins). These should be enough to run your own data.

--problem_type means what type of dataset folds we're using: (1) prediction of new interactions between known drugs and known targets, (2) prediction of known targets for new drugs, (3) prediction of known drugs for new targets,

DeepDTA explained in the article uses problem type (1) but it needs adjustment (maybe different values for different parameters) for the other problem types (2, 3) because how we look at the data changes.

What do you mean by always training? Since it performs 5-fold cv and parameter search (based on how many values you decided for number of filters or filter lengths) it is actually expected to take a lot of time especially if you are working on CPU. (I will try to provide the approx. runtimes for machine I used on readme)

--checkpoint argument is one my to-do's for the next version.

I hope you find this helpful. I'll try to do the updates asap.

Running-z commented 5 years ago

@hkmztrk Ok, look forward to your update, I can use your method and give you actual feedback, hopefully won't bother you, I just read your paper yesterday, and my suggestion is that you can refer to deepchem The data processing in this deep learning framework, which uses csv files, saves the property values of drugs and drugs, I think if you use this method to represent data, it may be more friendly to users, of course, this is only my suggestion. Enjoy yourself

yuanjun commented 5 years ago

Hello, I wonder have you tested your code in Windows environment? I got a few Errors like : No such file or directory: "'data/kiba/'folds/test_fold_setting1.txt" and "TypeError: object of type 'NoneType' has no len()". Thank you.

hkmztrk commented 5 years ago

@yuanjun I've tested it on Linux only, so no. I will try to run it on windows as well. It'll take me a little while. Sorry for the inconvenience.

hkmztrk commented 5 years ago

Hi @Running-z , what type of data input format will be convenient for you to use? I checked DeepChem, and it seems their input form is line by line?

Starlida commented 5 years ago

hi @hkmztrk, have you done independent test on other data? What should I do if I want to test my data?

hkmztrk commented 5 years ago

Hi @Starlida, in which format you have your data? If you could provide that it'll help me as well since I'm currently writing a helper code to make it easier to use on another data.

Starlida commented 5 years ago

I wonder what would the result look like if I try to train on data Davis then test on kiba?

hkmztrk commented 5 years ago

@Starlida, ok I can make an arrangement for that

Starlida commented 5 years ago

@hkmztrk Thank you so much!

Running-z commented 5 years ago

@hkmztrk I am very sorry, it will take so many days to reply to you, I hope you can understand! Deepchem provides a good data input format. I also think that the csv format file is very convenient in actual use. If it is accompanied by data, it can clearly understand the meaning of the data. But at present, there is no data description for your data format. To be honest, I don't know how to convert my data into the format you specified. This is my idea! Thank you.

XuanLin1991 commented 4 years ago

Hi, @hkmztrk Thank you for sharing your solid work. I tested your code on Linux and I want to train my data, I wonder how to generate the test/train_fold_setting.txt of my data. Thank you for help

hkmztrk commented 4 years ago

Hi @jacklin18 thanks a lot for your interest. I might be able to help on this with an update in a week or so. I'll let you know from here.

XuanLin1991 commented 4 years ago

Hi @jacklin18 thanks a lot for your interest. I might be able to help on this with an update in a week or so. I'll let you know from here.

Thank you for your reply soon. In the meanwhile, I notice anonther work named 'WideDTA', it integrates four kinds of input features through X2Vec (e.g., deepsmiles, prot2vec), when will you upload the code or can you send me by email? I think it's an interesting work. Moreover, graph neural networks have been the most popular topic in drug discovery and what do you think of it or is there any work following?

zhouhao-learning commented 4 years ago

@hkmztrk By the way, what I want to know is whether your protein sequence refers to an active fragment or the entire protein sequence.

hkmztrk commented 4 years ago

It's the whole protein sequence @zhouhao-learning

hkmztrk commented 4 years ago

Hi @jacklin18, thanks a lot for your interest. Unfortunately, I cannot share the code for WideDTA right now, since there is an ongoing work on it. But if you have questions about the data or model, I'd be happy to help.

zhouhao-learning commented 4 years ago

@hkmztrk ok,thank you!

XuanLin1991 commented 4 years ago

@hkmztrk Thank you for your reply. I basically understand the principle of WideDTA, but it is difficult for me to realize the source code, especially how to obtain the feature of Ligand and motif? And MCS method is mentioned in your paper, but I don't find its code on the Internet. Thank you!

hkmztrk commented 4 years ago

@jacklin18 please refer to WideDTA arxiv paper about how motifs and MCS are obtained.

Best.