tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

Issue with training model on C# dataset #65

Closed anki54 closed 4 years ago

anki54 commented 4 years ago

Hi, First of all great paper. I am using code2vec on C# dataset, I was able to pre-process the data using preprocess_chsharp.sh. When trying to run train.sh , I am getting following an indexing error exceeding 200, although it does not look so. Not sure why.

Appreciate your time.

Here is a complete run log: $ ./train.sh 2020-02-16 02:33:26.525726: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-02-16 02:33:26.548006: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1992000000 Hz 2020-02-16 02:33:26.549896: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d8993d7790 executing computations on platform Host. Devices: 2020-02-16 02:33:26.549988: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version 2020-02-16 02:33:26,551 INFO
2020-02-16 02:33:26,552 INFO
2020-02-16 02:33:26,552 INFO --------------------------------------------------------------------- 2020-02-16 02:33:26,552 INFO --------------------------------------------------------------------- 2020-02-16 02:33:26,552 INFO ---------------------- Creating word2vec model ---------------------- 2020-02-16 02:33:26,552 INFO --------------------------------------------------------------------- 2020-02-16 02:33:26,552 INFO --------------------------------------------------------------------- 2020-02-16 02:33:26,552 INFO Checking number of examples ... 2020-02-16 02:33:26,552 INFO Number of train examples: 3878 2020-02-16 02:33:26,553 INFO Number of test examples: 1793 2020-02-16 02:33:26,553 INFO --------------------------------------------------------------------- 2020-02-16 02:33:26,553 INFO ----------------- Configuration - Hyper Parameters ------------------ 2020-02-16 02:33:26,553 INFO CODE_VECTOR_SIZE 384 2020-02-16 02:33:26,553 INFO CSV_BUFFER_SIZE 104857600 2020-02-16 02:33:26,554 INFO DEFAULT_EMBEDDINGS_SIZE 128 2020-02-16 02:33:26,554 INFO DL_FRAMEWORK tensorflow 2020-02-16 02:33:26,554 INFO DROPOUT_KEEP_RATE 0.75 2020-02-16 02:33:26,554 INFO EXPORT_CODE_VECTORS False 2020-02-16 02:33:26,554 INFO LOGS_PATH None 2020-02-16 02:33:26,554 INFO MAX_CONTEXTS 200 2020-02-16 02:33:26,554 INFO MAX_PATH_VOCAB_SIZE 911417 2020-02-16 02:33:26,554 INFO MAX_TARGET_VOCAB_SIZE 261245 2020-02-16 02:33:26,554 INFO MAX_TOKEN_VOCAB_SIZE 1301136 2020-02-16 02:33:26,554 INFO MAX_TO_KEEP 10 2020-02-16 02:33:26,554 INFO MODEL_LOAD_PATH None 2020-02-16 02:33:26,554 INFO MODEL_SAVE_PATH models/sharp/saved_model 2020-02-16 02:33:26,554 INFO NUM_BATCHES_TO_LOG_PROGRESS 100 2020-02-16 02:33:26,554 INFO NUM_TEST_EXAMPLES 1793 2020-02-16 02:33:26,554 INFO NUM_TRAIN_BATCHES_TO_EVALUATE 1800 2020-02-16 02:33:26,554 INFO NUM_TRAIN_EPOCHS 20 2020-02-16 02:33:26,555 INFO NUM_TRAIN_EXAMPLES 3878 2020-02-16 02:33:26,555 INFO PATH_EMBEDDINGS_SIZE 128 2020-02-16 02:33:26,555 INFO PREDICT False 2020-02-16 02:33:26,555 INFO READER_NUM_PARALLEL_BATCHES 6 2020-02-16 02:33:26,555 INFO RELEASE False 2020-02-16 02:33:26,555 INFO SAVE_EVERY_EPOCHS 1 2020-02-16 02:33:26,555 INFO SAVE_T2V None 2020-02-16 02:33:26,555 INFO SAVE_W2V None 2020-02-16 02:33:26,555 INFO SEPARATE_OOV_AND_PAD False 2020-02-16 02:33:26,555 INFO SHUFFLE_BUFFER_SIZE 10000 2020-02-16 02:33:26,555 INFO TARGET_EMBEDDINGS_SIZE 384 2020-02-16 02:33:26,555 INFO TEST_BATCH_SIZE 1024 2020-02-16 02:33:26,555 INFO TEST_DATA_PATH data/csharp/csharp.val.c2v 2020-02-16 02:33:26,555 INFO TOKEN_EMBEDDINGS_SIZE 128 2020-02-16 02:33:26,555 INFO TOP_K_WORDS_CONSIDERED_DURING_PREDICTION 5 2020-02-16 02:33:26,555 INFO TRAIN_BATCH_SIZE 1024 2020-02-16 02:33:26,555 INFO TRAIN_DATA_PATH_PREFIX data/csharp/csharp 2020-02-16 02:33:26,556 INFO USE_TENSORBOARD False 2020-02-16 02:33:26,556 INFO VERBOSE_MODE 1 2020-02-16 02:33:26,556 INFO _Configlogger <Logger code2vec (INFO)> 2020-02-16 02:33:26,556 INFO context_vector_size 384 2020-02-16 02:33:26,556 INFO entire_model_load_path None 2020-02-16 02:33:26,556 INFO entire_model_save_path models/sharp/saved_model__entire-model 2020-02-16 02:33:26,556 INFO is_loading False 2020-02-16 02:33:26,556 INFO is_saving True 2020-02-16 02:33:26,556 INFO is_testing True 2020-02-16 02:33:26,556 INFO is_training True 2020-02-16 02:33:26,556 INFO model_weights_load_path None 2020-02-16 02:33:26,556 INFO model_weights_save_path models/sharp/saved_modelonly-weights 2020-02-16 02:33:26,556 INFO test_steps 2 2020-02-16 02:33:26,556 INFO train_data_path data/csharp/csharp.train.c2v 2020-02-16 02:33:26,556 INFO train_steps_per_epoch 4 2020-02-16 02:33:26,556 INFO word_freq_dict_path data/csharp/csharp.dict.c2v 2020-02-16 02:33:26,557 INFO --------------------------------------------------------------------- 2020-02-16 02:33:26,557 INFO Loading word frequencies dictionaries from: data/csharp/csharp.dict.c2v ... 2020-02-16 02:33:26,559 INFO Done loading word frequencies dictionaries. 2020-02-16 02:33:26,559 INFO Word frequencies dictionaries loaded. Now creating vocabularies. 2020-02-16 02:33:26,560 INFO Created token vocab. size: 313 2020-02-16 02:33:26,563 INFO Created path vocab. size: 4549 2020-02-16 02:33:26,564 INFO Created target vocab. size: 4 2020-02-16 02:33:26,569 INFO Done creating code2vec model 2020-02-16 02:33:26,570 INFO Starting training (<tf.Tensor 'IteratorGetNext:0' shape=(None,) dtype=int32>, <tf.Tensor 'IteratorGetNext:1' shape=(None, 200) dtype=int32>, <tf.Tensor 'IteratorGetNext:2' shape=(None, 200) dtype=int32>, <tf.Tensor 'IteratorGetNext:3' shape=(None, 200) dtype=int32>, <tf.Tensor 'IteratorGetNext:4' shape=(None, 200) dtype=float32>) WARNING: Logging before flag parsing goes to stderr. W0216 02:33:27.531919 140189473613632 deprecation.py:506] From /home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass _constraint arguments to layers. 2020-02-16 02:33:27,780 INFO Number of trainable params: 771712 2020-02-16 02:33:27,781 INFO variable name: model/WORDS_VOCAB:0 -- shape: (313, 128) -- #params: 40064 2020-02-16 02:33:27,781 INFO variable name: model/TARGET_WORDS_VOCAB:0 -- shape: (4, 384) -- #params: 1536 2020-02-16 02:33:27,781 INFO variable name: model/ATTENTION:0 -- shape: (384, 1) -- #params: 384 2020-02-16 02:33:27,781 INFO variable name: model/PATHS_VOCAB:0 -- shape: (4549, 128) -- #params: 582272 2020-02-16 02:33:27,781 INFO variable name: model/TRANSFORM:0 -- shape: (384, 384) -- #params: 147456 2020-02-16 02:33:27,886 INFO Initalized variables 2020-02-16 02:33:29,033 INFO Started reader... 2020-02-16 02:33:30.101047: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[78] = [25,3] is out of bounds: need 0 <= index < [200,3] Traceback (most recent call last): File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(args) File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map_PathContextReader._map_raw_dataset_row_to_expected_model_input_form_481}} indices[78] = [25,3] is out of bounds: need 0 <= index < [200,3] [[{{node SparseToDense}}]] [[IteratorGetNext]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "code2vec.py", line 23, in model.train() File "/home/anki/anki/github/codevec/code2vec/tensorflowmodel.py", line 80, in train , batch_loss = self.sess.run([optimizer, train_loss]) File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/anki/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[78] = [25,3] is out of bounds: need 0 <= index < [200,3] [[{{node SparseToDense}}]] [[IteratorGetNext]] 2020-02-16 02:33:30.143602: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[585] = [194,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.158836: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[189] = [62,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.164240: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[399] = [132,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.167973: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[30] = [9,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.208490: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[69] = [22,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.212536: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[171] = [56,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.228089: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[171] = [56,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.267801: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[30] = [9,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.301149: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[414] = [137,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.377550: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[138] = [45,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.408714: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[219] = [72,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.453522: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[66] = [21,3] is out of bounds: need 0 <= index < [200,3] 2020-02-16 02:33:30.460196: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at sparse_to_dense_op.cc:128 : Invalid argument: indices[414] = [137,3] is out of bounds: need 0 <= index < [200,3]

urialon commented 4 years ago

Hi @anki54 , Thank you for your interest in code2vec, I'm sorry for these errors.

I suspect that the input data contains commas (","). Let's check if the txt file TRAIN_DATA_FILE=${DATASET_NAME}.train.raw.txt contains unwanted commas. This file is deleted by the end of preprocessing here: https://github.com/tech-srl/code2vec/blob/master/preprocess_csharp.sh#L70 So can you try to comment this line: https://github.com/tech-srl/code2vec/blob/master/preprocess_csharp.sh#L70 , such that the temporary file will not be deleted, re-run preprocess_csharp.sh, and check if the MY_DATA.train.raw.txt has commas by running (and replace MY_DATA with the right file name):

cat MY_DATA.train.c2v | cut -d' ' -f2- | tr ' ' '\n' | awk -F',' 'NF > 3'

Let me know if running this produces anything. Best, Uri

anki54 commented 4 years ago

Thanks Uri. Here is the output from the command:

cat csharp.train.raw.txt | cut -d' ' -f2- | tr ' ' '\n' | awk -F',' 'NF > 3'

i,-1332631111ch,-322067902,noname db|connection,-1ole,55590853,write|line string,634449463,get|invalid|path,db|connection merchantabilityC|fitness|for|a|particular,COMMENT,merchantabilityC|fit,METHOD_NAME be|liable|to|any|party,COMMENT,be|liabl,1305740987,oracle|data|reader query,183d,1277539432,SPACE oracle|data|reader,-1747720328,ex48,e directory|services,14r,-189120134,writer u|C|in|parameter|sink,COMMENT,u|C|in|parameteer,-1031775743,METHOD_NAME 4,-188651100owing|three|paragraphs|appear|in,COMMENT,following|three|paragraphs|appear|in tainted,1577t,-189120134,document escape,-15391760mmand,1753799616,command software|and|its|documentation|for,COMMENT,smaintenanceC|supportC,COMMENT,obligation|to|provide|maintenanceC|supportC serverlocalhostuidsql|userpasswordsql|passworddatabasedbname,635959109,connect13,gt xml|node,-1641738437o|any|party,COMMENT,be|liable|to|any|party tainted,-5988936,METHlt,-41651315,find|all tainted,362962554,t3946,string string|builder,-209083712tory|entry,-112431143,METHOD_NAME text,-1362574111,METHOD_n|result,-153374817,result|search start|info,1713674598,argu,string tainted,1253936268,METHO025060,METHOD_NAME string,5doracle|userpasswordoracle|password,-1346018073,connection|string escape,-1360828253,ap77,METHOD_NAME i,5r|idpostgre|userpasswordpostgre|passworddatabasedbname,-165306176,connection|string str|result,-2123218456,str|resion,-543271287,e string,-15877137r,1233793853,match process,5tainted,1490817555,tainted str|connect,-763543able|to|any|party,COMMENT,be|liable|to|any|party connection|string,pear|in,COMMENT,following|three|paragraphs|appear|in process,1668479357,c|cat|tmptaint|builder,1084642933,METHOD_NAME software|and|its|documentationC|even,COMMENT,software|anend,1227716947,SPACE checked|data,746504430,parao|any|party,COMMENT,be|liable|to|any|party copyright|bertrand|stivalet|permission|is,COMMENT,copyright|bertrand|stivhardcoded|string|input|filtering|check,COMMENT,hardcoded|string|input|filtering|check node,-1362574111,ME951773,cn|result db|connection,1340068228,,document process,-15227222eplace,-286845252,SPACE writer,506824860,8572,append

urialon commented 4 years ago

Thank you, it seems like a problem in the C# extractor. Can you, by chance, share a minimal code snippet or file in which the above command prints such corrupted contexts? Maybe a snippet that contains strings from the above output, like "bertrand" or "merchantability"?

anki54 commented 4 years ago

Hi Uri,

A bit odd but the issue seemed to be not because of the C# code but presence of sub directories, I was training it on a test suite from SARD, it had a few sub directories and then C# code files. I directly extracted all the .CS files into /train_dir and ran preprocess_csharp.sh, no extra commas in the raw file now.
Appreciate your help and quick response.

urialon commented 4 years ago

Thanks for letting me know. I still consider this as a bug, but I am happy that you managed to solve it. Good luck, let me know if you have any other questions.

By the way, you might also want to check out our code2seq paper. Even for "single label" prediction tasks (i.e., not generating sequences), code2seq is better because it has a much better encoder.

shreyasingh commented 4 years ago

Hi All, I would like to train the code2vec model on C# dataset. Thanks for the detailed description of the Readme - I understood it clearly. I just wanted to ask if there's an existing dataset for C# files (unprocessed) which I could directly use in the preprocess_csharp.sh? Or if an existing C# processed data files exist which I can directly train the model on. Secondly are the tokens vectors available for download Java tokens, or do they have C# tokens too? If someone could point me to the token vectors file for C# token, that would be great as I won't have to go through the training phase :)

urialon commented 4 years ago

Hi, You can use the C# data from this paper: https://miltos.allamanis.com/publications/2018learning/

The token vectors were trained on Java. They might be useful in C# as well.

Best, Uri