IndexError: list index out of range when training on C dataset

Sohaib90 commented 3 years ago

I am running code2vec using Juliet test suite of c/c++ files (primarily focusing on c, using astminer by jetbrains to create code2vec format data)

I am running into this indexing error where in mode.train() 'self.filter_impossible_names_fn(top_words)[0]' returns an empty list. Is that supposed to happen? Any idea what might be the problem

The trace-back of the error is given as follows:

2021-05-21 16:45:41,365 INFO --------------------------------------------------------------------- 2021-05-21 16:45:41,365 INFO --------------------------------------------------------------------- 2021-05-21 16:45:41,365 INFO ---------------------- Creating code2vec model ---------------------- 2021-05-21 16:45:41,365 INFO --------------------------------------------------------------------- 2021-05-21 16:45:41,365 INFO --------------------------------------------------------------------- 2021-05-21 16:45:41,365 INFO Checking number of examples ... 2021-05-21 16:45:41,371 INFO Number of train examples: 13701 2021-05-21 16:45:41,396 INFO Number of test examples: 5184 2021-05-21 16:45:41,396 INFO --------------------------------------------------------------------- 2021-05-21 16:45:41,396 INFO ----------------- Configuration - Hyper Parameters ------------------ 2021-05-21 16:45:41,397 INFO CODE_VECTOR_SIZE 384 2021-05-21 16:45:41,397 INFO CSV_BUFFER_SIZE 104857600 2021-05-21 16:45:41,397 INFO DEFAULT_EMBEDDINGS_SIZE 128 2021-05-21 16:45:41,397 INFO DL_FRAMEWORK tensorflow 2021-05-21 16:45:41,397 INFO DROPOUT_KEEP_RATE 0.75 2021-05-21 16:45:41,397 INFO EXPORT_CODE_VECTORS False 2021-05-21 16:45:41,397 INFO LOGS_PATH None 2021-05-21 16:45:41,397 INFO MAX_CONTEXTS 200 2021-05-21 16:45:41,398 INFO MAX_PATH_VOCAB_SIZE 911417 2021-05-21 16:45:41,398 INFO MAX_TARGET_VOCAB_SIZE 261245 2021-05-21 16:45:41,398 INFO MAX_TOKEN_VOCAB_SIZE 1301136 2021-05-21 16:45:41,398 INFO MAX_TO_KEEP 10 2021-05-21 16:45:41,398 INFO MODEL_LOAD_PATH None 2021-05-21 16:45:41,398 INFO MODEL_SAVE_PATH models/clang/saved_model 2021-05-21 16:45:41,398 INFO NUM_BATCHES_TO_LOG_PROGRESS 100 2021-05-21 16:45:41,398 INFO NUM_TEST_EXAMPLES 5184 2021-05-21 16:45:41,398 INFO NUM_TRAIN_BATCHES_TO_EVALUATE 1800 2021-05-21 16:45:41,399 INFO NUM_TRAIN_EPOCHS 20 2021-05-21 16:45:41,399 INFO NUM_TRAIN_EXAMPLES 13701 2021-05-21 16:45:41,399 INFO PATH_EMBEDDINGS_SIZE 128 2021-05-21 16:45:41,399 INFO PREDICT False 2021-05-21 16:45:41,399 INFO READER_NUM_PARALLEL_BATCHES 6 2021-05-21 16:45:41,399 INFO RELEASE False 2021-05-21 16:45:41,399 INFO SAVE_EVERY_EPOCHS 1 2021-05-21 16:45:41,399 INFO SAVE_T2V None 2021-05-21 16:45:41,399 INFO SAVE_W2V None 2021-05-21 16:45:41,399 INFO SEPARATE_OOV_AND_PAD False 2021-05-21 16:45:41,399 INFO SHUFFLE_BUFFER_SIZE 10000 2021-05-21 16:45:41,400 INFO TARGET_EMBEDDINGS_SIZE 384 2021-05-21 16:45:41,400 INFO TEST_BATCH_SIZE 1024 2021-05-21 16:45:41,400 INFO TEST_DATA_PATH data/SOEX1/SOEX1.val.c2v 2021-05-21 16:45:41,400 INFO TOKEN_EMBEDDINGS_SIZE 128 2021-05-21 16:45:41,400 INFO TOP_K_WORDS_CONSIDERED_DURING_PREDICTION 10 2021-05-21 16:45:41,400 INFO TRAIN_BATCH_SIZE 1024 2021-05-21 16:45:41,400 INFO TRAIN_DATA_PATH_PREFIX data/SOEX1/SOEX1 2021-05-21 16:45:41,400 INFO USE_TENSORBOARD False 2021-05-21 16:45:41,400 INFO VERBOSE_MODE 1 2021-05-21 16:45:41,400 INFO _Configlogger <Logger code2vec (INFO)> 2021-05-21 16:45:41,401 INFO context_vector_size 384 2021-05-21 16:45:41,401 INFO entire_model_load_path None 2021-05-21 16:45:41,401 INFO entire_model_save_path models/clang/saved_model__entire-model 2021-05-21 16:45:41,401 INFO is_loading False 2021-05-21 16:45:41,401 INFO is_saving True 2021-05-21 16:45:41,401 INFO is_testing True 2021-05-21 16:45:41,401 INFO is_training True 2021-05-21 16:45:41,401 INFO model_load_dir None 2021-05-21 16:45:41,401 INFO model_weights_load_path None 2021-05-21 16:45:41,401 INFO model_weights_save_path models/clang/saved_modelonly-weights 2021-05-21 16:45:41,401 INFO test_steps 6 2021-05-21 16:45:41,402 INFO train_data_path data/SOEX1/SOEX1.train.c2v 2021-05-21 16:45:41,402 INFO train_steps_per_epoch 14 2021-05-21 16:45:41,402 INFO word_freq_dict_path data/SOEX1/SOEX1.dict.c2v 2021-05-21 16:45:41,402 INFO --------------------------------------------------------------------- 2021-05-21 16:45:41,402 INFO Loading word frequencies dictionaries from: data/SOEX1/SOEX1.dict.c2v ... 2021-05-21 16:45:41,591 INFO Done loading word frequencies dictionaries. 2021-05-21 16:45:41,591 INFO Word frequencies dictionaries loaded. Now creating vocabularies. 2021-05-21 16:45:41,593 INFO Created token vocab. size: 1537 2021-05-21 16:45:41,610 INFO Created path vocab. size: 14680 2021-05-21 16:45:41,618 INFO Created target vocab. size: 7329 2021-05-21 16:45:41,756 INFO Done creating code2vec model 2021-05-21 16:45:41,756 INFO Starting training WARNING:tensorflow:Entity <bound method PathContextReader._map_raw_dataset_row_to_expected_model_input_form of <path_context_reader.PathContextReader object at 0x7f0e0b90f510>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: No module named 'tensorflow_core.estimator' WARNING:tensorflow:Entity <bound method PathContextReader._filter_input_rows of <path_context_reader.PathContextReader object at 0x7f0e0b90f510>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: No module named 'tensorflow_core.estimator' WARNING:tensorflow:From /home/sohaib/anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. 2021-05-21 16:45:43,322 INFO Number of trainable params: 5037952 2021-05-21 16:45:43,322 INFO variable name: model/WORDS_VOCAB:0 -- shape: (1537, 128) -- #params: 196736 2021-05-21 16:45:43,322 INFO variable name: model/TARGET_WORDS_VOCAB:0 -- shape: (7329, 384) -- #params: 2814336 2021-05-21 16:45:43,322 INFO variable name: model/ATTENTION:0 -- shape: (384, 1) -- #params: 384 2021-05-21 16:45:43,322 INFO variable name: model/PATHS_VOCAB:0 -- shape: (14680, 128) -- #params: 1879040 2021-05-21 16:45:43,322 INFO variable name: model/TRANSFORM:0 -- shape: (384, 384) -- #params: 147456 2021-05-21 16:47:03,288 INFO Initalized variables 2021-05-21 16:47:05,370 INFO Started reader... 2021-05-21 16:47:05.853955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-05-21 16:47:19,621 INFO Saved after 1 epochs in: models/clang/saved_model_iter1 WARNING:tensorflow:Entity <bound method PathContextReader._map_raw_dataset_row_to_expected_model_input_form of <path_context_reader.PathContextReader object at 0x7f0dfc780e90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: No module named 'tensorflow_core.estimator' WARNING:tensorflow:Entity <bound method PathContextReader._filter_input_rows of <path_context_reader.PathContextReader object at 0x7f0dfc780e90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: No module named 'tensorflow_core.estimator' 2021-05-21 16:47:20,281 INFO Starting evaluation Traceback (most recent call last): File "code2vec.py", line 23, in model.train() File "/home/sohaib/Desktop/master_thesis/code2vec/tensorflow_model.py", line 95, in train evaluation_results = self.evaluate() File "/home/sohaib/Desktop/master_thesis/code2vec/tensorflow_model.py", line 172, in evaluate subtokens_evaluation_metric.update_batch(zip(original_names, top_words)) File "/home/sohaib/Desktop/master_thesis/code2vec/tensorflow_model.py", line 460, in update_batch prediction = self.filter_impossible_names_fn(top_words)[0] IndexError: list index out of range

urialon commented 3 years ago

Hi @Sohaib90 , Probably the names in C do not match the pattern that we defined for Java.

I think that you can just comment out this line: https://github.com/tech-srl/code2vec/blob/master/common.py#L128 and replace it with return top_words to cancel this filtering.

Best, Uri

Sohaib90 commented 3 years ago

Seems to be training now. Thank you so much for the prompt reply and all the work you have put into this. Cheers!

estiver-alvarez commented 3 years ago

I have the same issue, How could you fix it ?

Seems to be training now. Thank you so much for the prompt reply and all the work you have put into this. Cheers!

tech-srl / code2vec

IndexError: list index out of range when training on C dataset #117