Can't I use my own data for training?

Running-z commented 6 years ago

I want to train a tf_regression model with my own data. My data has 12628 molecules, 1 protein sequence,I changed the attribute fields in the data to be the same as the davis data you provided., and they look like this:

jak2

Then I ran drive4_d_warm.sh, but I got the following error:

Traceback (most recent call last):
  File "driver.py", line 696, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 278, in run_analysis
    prediction_file=csv_out)
  File "/project/git2/PADME/dcCustom/molnet/run_benchmark_models.py", line 217, in model_regression
    no_r2=no_r2)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 178, in fit
    max_checkpoints_to_keep, checkpoint_interval, restore, submodel)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 378, in fit_generator
    for feed_dict in self._create_feed_dicts(feed_dict_generator, True):
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 1107, in _create_feed_dicts
    for d in generator:
  File "/project/git2/PADME/dcCustom/models/tensorgraph/fcnet.py", line 331, in default_generator
    pad_batches=pad_batches):
  File "/project/git2/PADME/dcCustom/data/datasets.py", line 758, in iterate
    next_shard = pool.apply_async(dataset.get_shard, (shard_perm[0],))
IndexError: index 0 is out of bounds for axis 0 with size 0

Why is this happening? Is there any problem with my data? When extracting features, my data has all been removed. Why is this? What should I do?

simonfqy commented 6 years ago

Again, try to add a phosphorylation field right before A, exactly following my prot_desc.csv format. Ask me if you still have any problems after that.

And note that --dataset davis parameter should be preserved. It would invoke the load_davis() function defined in /molnet/load_function/davis_dataset.py, which would in turn use the davis_data/restructured.csv file to generate the data for further processing. You can (and perhaps should) customize the load functions like load_davis().

Running-z commented 6 years ago

@simonfqy My data has been processed in the format of davis data. mer

mer

But I still can't train, I got the following error:

Traceback (most recent call last):
  File "driver.py", line 699, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 278, in run_analysis
    prediction_file=csv_out)
  File "/project/git2/PADME/dcCustom/molnet/run_benchmark_models.py", line 217, in model_regression
    no_r2=no_r2)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 178, in fit
    max_checkpoints_to_keep, checkpoint_interval, restore, submodel)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 378, in fit_generator
    for feed_dict in self._create_feed_dicts(feed_dict_generator, True):
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 1107, in _create_feed_dicts
    for d in generator:
  File "/project/git2/PADME/dcCustom/models/tensorgraph/fcnet.py", line 331, in default_generator
    pad_batches=pad_batches):
  File "/project/git2/PADME/dcCustom/data/datasets.py", line 758, in iterate
    next_shard = pool.apply_async(dataset.get_shard, (shard_perm[0],))
IndexError: index 0 is out of bounds for axis 0 with size 0

This is the content of the bash file I am running:

CUDA_VISIBLE_DEVICES=0
spec='python driver.py --dataset davis \
--model tf_regression --prot_desc_path davis_data/Mer_psc2_Phosphorylated=0.csv \
--model_dir ./model_dir/model_dir4_davis_w --filter_threshold 1 \
--arithmetic_mean --aggregate toxcast \
--intermediate_file ./interm_files/intermediate_cv_warm_3.csv '
eval $spec

simonfqy commented 6 years ago

@Running-z Have you already stored the trained model in ./model_dir/model_dir4_davis_w? Besides, when you're predicting, please remove the parameter --filter_threshold 1, because that parameter is only used for training and validation, not for predicting.

Running-z commented 6 years ago

@simonfqy No, I am not predicting, I am training my data, I am using my own data for training, I am wrong, not a prediction.

simonfqy commented 6 years ago

Hi, I guess the reason is that you used the parameter --filter_threshold 1. This parameter will cause the program to remove any entities (compounds or proteins) that only appeared once in the whole dataset. As you only have 1 protein, all your compounds only appear once in the dataset, causing them to be removed completely from the dataset, rendering the dataset empty.

Please remove this parameter and try again.

Running-z commented 6 years ago

@simonfqy I tried the method you said again. I removed the --filter_threshold 1 parameter, but I can't use --split random.otherwise I get the following error:

Traceback (most recent call last):
  File "driver.py", line 699, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 172, in run_analysis
    filter_threshold=filter_threshold)
  File "/project/git2/PADME/dcCustom/molnet/load_function/davis_dataset.py", line 110, in load_davis
    fold_datasets = splitter.k_fold_split(dataset, K)
  File "/project/git2/PADME/dcCustom/splits/splitters.py", line 121, in k_fold_split
    frac_test=0)
  File "/project/git2/PADME/dcCustom/splits/splitters.py", line 844, in split
    assert self.threshold > 0
AssertionError

Then I clone your latest code, run the dirve4_d_warm.sh file and get the following error:

Traceback (most recent call last):
  File "driver.py", line 716, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 175, in run_analysis
    filter_threshold=filter_threshold, oversampled=oversampled)
  File "/project/git3/PADME/dcCustom/molnet/load_function/davis_dataset.py", line 109, in load_davis
    fold_datasets = splitter.k_fold_split(dataset, K)
  File "/project/git3/PADME/dcCustom/splits/splitters.py", line 122, in k_fold_split
    frac_test=0)
  File "/project/git3/PADME/dcCustom/splits/splitters.py", line 917, in split
    assert len(entry_to_write) == 1
AssertionError

The content of drive4_d_warm.sh is:

CUDA_VISIBLE_DEVICES=5
spec='python driver.py --dataset davis --cross_validation 
--model tf_regression --prot_desc_path davis_data/prot_desc.csv 
--model_dir ./model_dir/model_dir4_davis_w --filter_threshold 1 
--arithmetic_mean --aggregate toxcast --split_warm 
--intermediate_file ./interm_files/intermediate_cv_warm_3.csv '
eval $spec

Then I removed the --filter_threshold 1 from drive4_d_warm.sh and I got the following error:

Traceback (most recent call last):
  File "driver.py", line 716, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 175, in run_analysis
    filter_threshold=filter_threshold, oversampled=oversampled)
  File "/project/git3/PADME/dcCustom/molnet/load_function/davis_dataset.py", line 109, in load_davis
    fold_datasets = splitter.k_fold_split(dataset, K)
  File "/project/git3/PADME/dcCustom/splits/splitters.py", line 122, in k_fold_split
    frac_test=0)
  File "/project/git3/PADME/dcCustom/splits/splitters.py", line 865, in split
    remain_this_mol_entries = mol_entries[molecule] - removed_entries
UnboundLocalError: local variable 'removed_entries' referenced before assignment

My data format is the same as davis data: default

Excuse me, I’m exposing so many problems, but it’s really my confusion, I hope you guide.

simonfqy commented 6 years ago

@Running-z Hi, I assume that you have only one protein in your interaction file, is that right? If so, the split_warm cannot work, because every compound only appear once in the whole dataset, there is no way to ensure that every compound appears in at least two cross-validation folds (the meaning of split_warm). If that is the case, simply remove the split_warm parameter in addition to the filter_threshold 1 parameter.

Running-z commented 6 years ago

@simonfqy In fact, I have 8 kinds of proteins, but not all molecules interact with these 8 proteins. The molecules of each target protein only interact with the corresponding proteins. There may be cases where all molecules appear only once, so I According to the method you said, the --filter_threshold 1 and --split_warm parameters were removed, and then the training was modified to 1 training session, but I still got an unexpected error:

Traceback (most recent call last):
  File "driver.py", line 696, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "driver.py", line 378, in run_analysis
    train_score = train_scores_list[h]
IndexError: list index out of range

At the same time, if I remove the --cross_validation parameter, I will be able to complete the training.

simonfqy commented 6 years ago

@Running-z Glad to know that you were able to complete the training process. Don't know why there is list index out of range error when --cross_validation is enabled. Please refer to this line of code: https://github.com/simonfqy/PADME/blob/5e97ba97f1389ea975b196a31b3464ca2cd00512/driver.py#L288 Previously I made an error in that line of code, I used range(1, fold_num) instead of range(fold_num). It was initially a hard-coded temporary thing to run cross-validation folds in separate times, but I forgot to fix it after that task was done, and I accidentally uploaded it to GitHub. If your code uses range(1, fold_num), it could be the source of this problem. Many apologies. According to the shell script you shared, at least part of your cross-validation results were saved in ./interm_files/intermediate_cv_warm_3.csv. I saved them automatically after each fold to prevent system crashes destroying all your work before it can be output to result files.

Running-z commented 6 years ago

@simonfqy Ok, I saw the code, the code I used is indeed for h in range(1,fold_num), I will modify it and continue to try, thank you

simonfqy commented 6 years ago

Seems to be solved. Closed for now.

simonfqy / PADME

Can't I use my own data for training? #6