A preprocess problem which I can't solve it!

satinewee commented 3 years ago

bash my_preprocess.sh Extracting paths from validation set... Finished extracting paths from validation set Extracting paths from test set... Finished extracting paths from test set Extracting paths from training set... multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, kwds)) File "/usr/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar return list(itertools.starmap(args[0], args[1])) File "JavaExtractor/extract.py", line 22, in ParallelExtractDir ExtractFeaturesForDir(args, dir, "") File "JavaExtractor/extract.py", line 35, in ExtractFeaturesForDir with open(outputFileName, 'a') as outputFile: IsADirectoryError: [Errno 21] Is a directory: './tmp/feature_extractor7619/'** """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "JavaExtractor/extract.py", line 95, in ExtractFeaturesForDirsList(args, subdirs) File "JavaExtractor/extract.py", line 67, in ExtractFeaturesForDirsList p.starmap(ParallelExtractDir, zip(itertools.repeat(args), dirs)) File "/usr/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 274, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/usr/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value IsADirectoryError: [Errno 21] Is a directory: './tmp/feature_extractor7619/' Finished extracting paths from training set Creating histograms from the training data train data file size: 0 val data file size: 0 TEST data file size: 0 subtoken vocab size: 0 node vocab size: 0 target vocab size: 0 File: deepcom.test.raw.txt Traceback (most recent call last): File "preprocess.py", line 115, in max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts)) File "preprocess.py", line 53, in process_file print('Average total contexts: ' + str(float(sum_total) / total)) ZeroDivisionError: float division by zero

I have read all the questions about this in the issue, but I did not find the same question as mine, so I need your help.

urialon commented 3 years ago

Hi @satinewee , Thank you for your interest in code2seq!

Can you please specify your:

Operating system
Python version (python3 --version)
Java version (java --version)
Tensorflow version (python3 -c 'import tensorflow as tf; print(tf.__version__)')

Best, Uri

satinewee commented 3 years ago

1.ubuntu 2.python3.6.6 3.openjdk version 1.8.0_265 4.tensorflow 1.9.0 Thank you.

------------------ Original ------------------ From: Uri Alon <notifications@github.com> Date: Wed,Mar 10,2021 4:01 PM To: tech-srl/code2seq <code2seq@noreply.github.com> Cc: satinewee <3248777@qq.com>, Mention <mention@noreply.github.com> Subject: Re: [tech-srl/code2seq] A preprocess problem which I can't solve it! (#87)

Hi @satinewee , Thank you for your interest in code2seq!

Can you please specify your:

Operating system

Python version (python3 --version)

Java version (java --version)

Tensorflow version (python3 -c 'import tensorflow as tf; print(tf.version)')

Best, Uri

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

urialon commented 3 years ago

Validation and test succeed, the problem is with processing your training set.

Is there anything unusual about your training set or its directory structure? Is it made of several subdirectories that you can try extracting one-by-one, to see which of them is causing the problem?

On Wed, Mar 10, 2021 at 10:05 AM satinewee @.***> wrote:

1.ubuntu 2.python3.6.6 3.openjdk version 1.8.0_265 4.tensorflow 1.9.0 Thank you.

------------------ Original ------------------ From: Uri Alon @.> Date: Wed,Mar 10,2021 4:01 PM To: tech-srl/code2seq @.> Cc: satinewee @.>, Mention @.>

Subject: Re: [tech-srl/code2seq] A preprocess problem which I can't solve it! (#87)

Hi @satinewee , Thank you for your interest in code2seq!

Can you please specify your:

Operating system

Python version (python3 --version)

Java version (java --version)

Tensorflow version (python3 -c 'import tensorflow as tf; print(tf.version)')

Best, Uri

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tech-srl/code2seq/issues/87#issuecomment-795053904, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMESFJXGQ6OAH2YNKQTTC4R5BANCNFSM4Y5MWFUQ .

satinewee commented 3 years ago

Validation and test succeed, the problem is with processing your training set. Is there anything unusual about your training set or its directory structure? Is it made of several subdirectories that you can try extracting one-by-one, to see which of them is causing the problem? … On Wed, Mar 10, 2021 at 10:05 AM satinewee @.> wrote: 1.ubuntu 2.python3.6.6 3.openjdk version 1.8.0_265 4.tensorflow 1.9.0 Thank you. ------------------ Original ------------------ From: Uri Alon @.> Date: Wed,Mar 10,2021 4:01 PM To: tech-srl/code2seq @.> Cc: satinewee @.>, Mention @.***> Subject: Re: [tech-srl/code2seq] A preprocess problem which I can't solve it! (#87) Hi @satinewee , Thank you for your interest in code2seq! Can you please specify your: Operating system Python version (python3 --version) Java version (java --version) Tensorflow version (python3 -c 'import tensorflow as tf; print(tf.version)') Best, Uri — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#87 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMESFJXGQ6OAH2YNKQTTC4R5BANCNFSM4Y5MWFUQ .

Thanks for your help. Now I meet a new problem.In the preprocessing phase, when I input 20,000 lines of data, I only get about 14,000 lines out. The following is the error report. How should I solve it? JavaExtractor.ExtractFeaturesTask.processFile(ExtractFeaturesTask.java:34)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:27)\n\tat JavaExtractor.ExtractFeaturesTask.call(ExtractFeaturesTask.java:16)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\njava.util.concurrent.ExecutionException: com.github.javaparser.ParseProblemException: Encountered unexpected token: ">" ">"\n at line 3, column 312.\n\nWas expecting one of:\n\n "!"\n "("\n "+"\n "++"\n "-"\n "--"\n "boolean"\n "byte"\n "char"\n "double"\n "false"\n "float"\n "int"\n "long"\n "new"\n "null"\n "short"\n "super"\n "this"\n "true"\n "void"\n "~"\n \n \n \n \n \n \n\n\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:192)\n\tat JavaExtractor.App.lambda$extractDir$3(App.java:59)\n\tat java.util.ArrayList.forEach(ArrayList.java:1259)\n\tat JavaExtractor.App.extractDir(App.java:57)\n\tat JavaExtractor.App.main(App.java:32)\nCaused by: com.github.javaparser.ParseProblemException: Encountered unexpected token: ">" ">"\n at line 3, column 312.\n\nWas expecting one of:\n\n "!"\n "("\n "+"\n "++"\n

urialon commented 3 years ago

Hi @satinewee , What do you mean by "I input 20,000 lines of data"? The input is handled by methods. Every input method in the input becomes a single line in the output.

There are some filters like dropping methods that are empty, or too long. Specifically, in the error that you show here, there is a ParseProblemException, which means that the file does not parse, there is some syntactic problem.

Best, Uri

satinewee commented 3 years ago

Hi, Now I meet another question, can you help me? Now part of my data can be parsed when I use the Java Parser. I want to know whether this part of data can be saved instead of being extracted directly. If so, where it should be modified. Thank you.

------------------ Original ------------------ From: Uri Alon @.> Date: Fri,Mar 12,2021 3:45 AM To: tech-srl/code2seq @.> Cc: satinewee @.>, Mention @.> Subject: Re: [tech-srl/code2seq] A preprocess problem which I can't solve it! (#87)

Hi @satinewee , What do you mean by "I input 20,000 lines of data"? The input is handled by methods. Every input method in the input becomes a single line in the output.

There are some filters like dropping methods that are empty, or too long. Specifically, in the error that you show here, there is a ParseProblemException, which means that the file does not parse, there is some syntactic problem.

Best, Uri

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

urialon commented 3 years ago

Hi @satinewee , Yes, you can take a look here: https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/FeatureExtractor.java#L50

This is the point in the code where the input file is parsed. Instead of extracting features, you can save it or extract other kinds of features.

I hope it helps! Best, Uri

tech-srl / code2seq

A preprocess problem which I can't solve it! #87