tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
556 stars 164 forks source link

how to process my own data? #27

Closed chao276951044 closed 4 years ago

chao276951044 commented 4 years ago

As we can see in preprocess.sh: in line 36 TRAIN_DATA_FILE=${DATASET_NAME}.train.raw.txt VAL_DATA_FILE=${DATASET_NAME}.val.raw.txt TEST_DATA_FILE=${DATASET_NAME}.test.raw.txt

Can you tell me the format of train.raw.txt?

And after processing, the dataset is generated in TRAIN_DIR=my_training_dir?

urialon commented 4 years ago

Hi, Thank you for your interest in code2seq!

For the format see a description here: https://github.com/tech-srl/code2seq/blob/master/README.md#extending-to-other-languages

Alternatively, you can simply run preprocessing on a small directory (e.g., you can run preprocessing on the JavaExtractor code itself) and see the format.

The TRAIN_DIR is the source of the data, not the target. The data is generated at "data/${dataset_name}/" dir. See also the comment on the top of "preprocess.sh".

Best, Uri