tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
555 stars 164 forks source link

Encountered error of preprocess data #127

Open yingdehuijin opened 2 years ago

yingdehuijin commented 2 years ago

Hi,Uri Hi, I am using code2seq to run on EMSE-DeepCom https://github.com/xing-hu/EMSE-DeepCom newest datasets. I followed your suggestiones to run scripts preprocess.sh,but i have encountered errors on test/val/train datasets.The error_log.txt and stdout show the following information: b'java.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: ">" ">"\n at line 2, column 407.\n\nWas expecting one of:\n\n And examples are decreased: 20000 test methods hava decreased to 17060 , 20000 valid methods decreased to 17043 and 480000 methods decreased to 380001. Are there something wrong with the datasets? Looking forward your reply! Wcc

urialon commented 2 years ago

Hi @yingdehuijin , Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best, Uri

yingdehuijin commented 2 years ago

Hi @yingdehuijin , Thank you for your interest in our work!

I don't know if there is anything wrong with this dataset, I have never used it.

However, it does seem like the files there will not parse. Are they raw java files? Maybe they have a different format? Our preprocessing pipeline expects raw java files.

Can you provide a single example from the dataset?

Best, Uri

Thank you for your reply A single example from the dataset is like this: code: public static DecomposableMatchBuilder1 < Float , Float > caseFloat ( MatchesAny f ) { List < Matcher < Object > > matchers = new ArrayList < > ( ) ; matchers . add ( any ( ) ) ; return new DecomposableMatchBuilder1 < > ( matchers , NUM_ , new PrimitiveFieldExtractor < > ( Float . class ) ) ; } nl: matches a float .

urialon commented 2 years ago

The "nl: matches a float" are part of the same file? Our JavaExtractor expects pure java files, and extracts the method names as the labels. You can replace the existing method name (DecomposableMatchBuilder1) with a unique ID, remove the "nl: matches a float", and later, replace the unique ID in the processed files with the natural language sequence that you wish to generate.

See also: https://github.com/tech-srl/code2seq/issues/45

Best, Uri

lidiancracy commented 1 year ago

Hello, I encountered the same issue while preprocessing the files. Does the original JAR package handle exceptions, such as skipping files that do not meet the format requirements without preprocessing them? I'm using it to process my own dataset, but it's throwing errors. I'm not sure if it will keep getting stuck there.

urialon commented 1 year ago

Hi @lidiancracy , Thank you for your interest in our work.

The truth is that I don't remember, this code was written about 5 years ago. If you wish to debug it go ahead, the entire java code is available in this repo.

But I recommend using newer models such as PolyCoder: https://github.com/VHellendoorn/Code-LMs https://arxiv.org/pdf/2202.13169.pdf

Best, Uri

lidiancracy commented 1 year ago

@urialon Thank you for your timely reply. My .sh file now terminates normally and has produced 4 files with the .c2s extension. I think the logic in the JAR package is probably fine. By the way, can I continue to train a new dataset on a model that has been trained well, similar to transfer learning and incremental training? I did not find any relevant information in the readme, did I miss something?Thank you in advance.

lidiancracy commented 1 year ago

Sorry to bother you.I trained the model using default parameters, but now only the dictionary remains as shown in the picture. Is this normal? image

urialon commented 1 year ago

Yes