wasiahmad / AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
https://arxiv.org/abs/2108.11590
Creative Commons Attribution Share Alike 4.0 International
53 stars 10 forks source link

Absent new_lines and indentation in python data #5

Closed nadiinchi closed 1 year ago

nadiinchi commented 2 years ago

Hi!

I downloaded data from AVATAR/data/data.zip and also using script AVATAR/data/download.sh, and it seems that a lot of python functions in the dataset miss new_lines and indentation. For example CodeForces/421/A/solution1.py:

n, a, b = map(int, input().split())athur = map(int, input().split())alex = map(int, input().split()) total = [1] * n for i in alex:    total[i-1] = 2 print(*total)

or CodeForces/981/A/solution1.py:

s=input()c=len(s)for i in range(len(s)-1,0,-1):    k=s[0:i+1]    if(k!=k[::-1]):        print(c)        exit()    c-=1if(c==1):    print("0")

According to my simple heuristic calculation, about 50% of python functions look like this.

Is there way to fix it? Thanks in advance for your help!

zfj1998 commented 2 years ago

Same question. linebreakers and indentation are really important to rebuild the syntax tree.

wasiahmad commented 1 year ago

We have resolved the issue by re-crawling the dataset. We released the new dataset along with other updates.