sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
113 stars 40 forks source link

Supervised learning train error: need at least one array to concatenate #4

Closed hejing3283 closed 5 years ago

hejing3283 commented 5 years ago

I am running a testing using my own data After loading the data successfully, I got an error when training:

Load Data from directories

DTCR_WF.Get_Data(directory='data_test/', Load_Prev_Data=False, aggregate_by_aa=True, aa_column_beta=1,v_beta_column=3,d_beta_column=4,j_beta_column=5, count_column=6,n_jobs = 2, sep=",") DTCR_WF.Get_Train_Valid_Test(test_size=0.2) DTCR_WF.Train() error msg start --------------------------------------------------------------------------- ValueError Traceback (most recent call last)

in 1 DTCR_WF.Get_Train_Valid_Test(test_size=0.2) ----> 2 DTCR_WF.Train() 3 4 # DTCR_WF.Monte_Carlo_CrossVal(folds=5,test_size=0.3,stop_criterion=0.25,epochs_min=100, 5 # suppress_output = False) ~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla) 3148 3149 valid_loss, valid_accuracy, valid_predicted, valid_auc = \ -> 3150 Run_Graph_WF(self.valid, sess, self, GO, batch_size, random=False, train=False) 3151 3152 ~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, random, train, drop_out_rate) 390 loss = np.mean(loss) 391 accuracy = np.mean(accuracy) --> 392 predicted_out = np.vstack(predicted_list) 393 try: 394 auc = roc_auc_score(set[-1], predicted_out) ~/anaconda3/envs/dl/lib/python3.7/site-packages/numpy/core/shape_base.py in vstack(tup) 281 """ 282 _warn_for_nonsequence(tup) --> 283 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0) 284 285 ValueError: need at least one array to concatenate End of error msg ------------------------------------ The directory structure is as following: ------------------------------------ data_test/ ├── A │   ├── A_1.csv │   └── A_2.csv ├── B │   ├── B_1.csv │   └── B_2.csv ├── C │   ├── C_1.csv │   └── C_2.csv └── D ├── D_1.csv └── D_2.csv In each csv file, there is beta chain information AAACCTGCAGGCTGAA-1,CASSIRDTETLYF,498,TRBV16,TRBD1,TRBJ2-3,1 AAACGGGAGGGTGTGT-1,CASGEGQTNSDYTF,568,TRBV13-2,TRBD1,TRBJ1-2,5 AAACGGGGTCTTCAAG-1,CASSGQNQDTQYF,503,TRBV15,TRBD1,TRBJ2-5,1 AAACGGGTCTAACTGG-1,CASSLGWHSYEQYF,572,TRBV16,None,TRBJ2-7,3 AAAGATGAGAATTGTG-1,CASGPGQSNTEVFF,527,TRBV13-2,TRBD1,TRBJ1-1,7 AAAGCAATCTGGCGAC-1,CASSDGLGGLEQYF,481,TRBV13-1,TRBD2,TRBJ2-7,7 AAATGCCCAATCCAAC-1,CAWVDWAQNTLYF,544,TRBV31,TRBD2,TRBJ2-4,3 AAATGCCTCGGCTTGG-1,CSAQGAHTEVFF,566,TRBV1,TRBD1,TRBJ1-1,18 AACACGTGTATAATGG-1,CASSSPLAGQDTQYF,519,TRBV3,None,TRBJ2-5,1 Number of records for each input file : 808 data_test/A/A_1.csv 1920 data_test/A/A_2.csv 2163 data_test/B/B_1.csv 1879 data_test/B/B_2.csv 836 data_test/C/C_1.csv 1182 data_test/C/C_2.csv 1705 data_test/D/D_1.csv 2091 data_test/D/D_2.csv
sidhomj commented 5 years ago

I think because your data set is only 8 samples, the test size is too small. The test size fraction is how much is used for the valid and test sets. If you have it set to 0.2, that's 1.6 samples for 2 sets which would not work. I would recommend in this case training with the LOO = 1 where 1 sample gets used to validation and one gets used for the test set. One can set this parameter in either a monte-carlo simulation or k-fold cross val. Let me know if this is was the issue and I'll write something into the code to catch when this happens and alert the user.

hejing3283 commented 5 years ago

Thanks for the explanation. I realized it and tried with more data, each label has 8 samples, changed test for 0.5 which allows 2 samples for validation and test independently. Now I am getting a new error

err msg start----------- Traceback (most recent call last): File "run_deepTCR_1_main.py", line 84, in DTCR_WF.Train() File "/Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py", line 3164, in Train TypeError: unsupported format string passed to list.format err msg end ------------------------------

sidhomj commented 5 years ago

It seems like in the output statistics, something is getting passed to the print statement that is not correct. You said each folder now has 8 csv files in each one?

hejing3283 commented 5 years ago

Yes. I added more samples. Now each folder has 8 .csv files. The same format as before

sidhomj commented 5 years ago

if you send your data or a part of it to my email, i might be able to better assess the issue you are having. jsidhom1@jhmi.edu

sidhomj commented 5 years ago

I would also recommend trying this and seeing if it works after you load the data.

DTCR_WF.Monte_Carlo_CrossVal(folds=5,LOO=1)

hejing3283 commented 5 years ago

Thanks for your help ahead! I am sending you 2 of the 4 directory.

On Tue, Apr 23, 2019 at 11:27 AM John-William Sidhom < notifications@github.com> wrote:

if you send your data or a part of it to my email, i might be able to better assess the issue you are having. jsidhom1@jhmi.edu

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sidhomj/DeepTCR/issues/4#issuecomment-485852530, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUAIIMMKPPK6BSVVKF5GQTPR4TFXANCNFSM4HHXTAGQ .

-- Cheers! Jing

E-mail: jing.he@dbmi.columbia.edu violet.hj@gmail.com

hejing3283 commented 5 years ago

Just tried the MCCV, similar error

err msg start--------------------------------------------------------------- Traceback (most recent call last): File "run_deepTCR_1_main.py", line 85, in DTCR_WF.Monte_Carlo_CrossVal(folds=5,LOO=1) File "/Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py", line 3373, in Monte_Carlo_CrossVal File "/Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py", line 3164, in Train TypeError: unsupported format string passed to list.format err msg end---------------------------------

Also, I was getting waring msgs say some of the tensorflow functions are depreciated, not sure if this is related.

warning msg start------------------------------------------------------------ WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/Layers.py:98: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead. WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/Layers.py:99: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.flatten instead. WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/Layers.py:102: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py:3098: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. 2019-04-23 11:32:31.648901: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA warning msg end------------------------------------------------------------

sidhomj commented 5 years ago

I just ran the following code and it worked fine..

The tensorflow deprecation warnings are normal. Will eventually need to update the code for tensorflow 2.0 but for now, it should work fine.

image

hejing3283 commented 5 years ago

The only difference I have is the Get_Data parameter positions. But I think it is not position sensitive.
I changed it, used the same script as you did, uninstall and install the package again, and it worked now!

Thanks so much!! Much appreciated!

sidhomj commented 5 years ago

Awesome! I just made some final updates. I would re-install the latest version 1.2.17.

Thanks!

hejing3283 commented 5 years ago

Got you! 👍