Training on the small dataset

tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"

http://code2seq.org

MIT License

555 stars 164 forks source link

Training on the small dataset #86

Closed AriYJ closed 3 years ago

AriYJ commented 3 years ago

In the current README file, the instruction for running evaluation on the pre-trained model includes downloading the 125G large Java data. Due to space restriction, I am hoping to run the pre-trained model on the small Java dataset mentioned in the paper. I see the small data set is available for direct download. However, are there other configurations I need to change in order to run python3 code2seq.py --load models/java-large-model/model_iter52.release --test data/java-large/java-large.test.c2s on the small dataset? Thank you in advance!

AriYJ commented 3 years ago

I tried changing the 'large' to 'small' in the address for downloading the large model but couldn't access it.

% wget https://s3.amazonaws.com/code2seq/model/java-small/java-small-model.tar.gz
--2021-02-18 01:12:48--  https://s3.amazonaws.com/code2seq/model/java-small/java-small-model.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.94.181
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.94.181|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-02-18 01:12:48 ERROR 403: Forbidden.

urialon commented 3 years ago

Hi @AriYJ , Thank you for your interest in code2seq!

I just uploaded a model that was trained on java-small. Can you try again: wget https://s3.amazonaws.com/code2seq/model/java-small/java-small-model.tar.gz

Best, Uri

AriYJ commented 3 years ago

Thank you Uri! This is super helpful - it worked! For anyone who wants to evaluate the trained small model on the small dataset - use python3 code2seq.py --load models/java-small/saved_model_iter13 --test data/java-small/java-small.test.c2s for evaluation:)

Avv22 commented 2 years ago

@AriYJ.

Hello,

I have 20k Java files. I would like to predict only one embedding vector for each file. I am looking to use the trained model to get embeddings (predictions) for our Java dataset. Should we first run extractor and then do prediction or what please? thon3 code2seq.py --load models/java-small/saved_model_iter13 --predict ` Can you please guide me how to do that? Thanks.