wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 35 forks source link

Multilingual `prepare.sh` throws an error after downloading #26

Closed gchhablani closed 2 years ago

gchhablani commented 2 years ago

While running prepare.sh the followign errors are thrown for all the languages in multilingual directory:

FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/valid.php-en_XX.php'
Traceback (most recent call last):
  File "encode.py", line 92, in <module>
    main()
  File "encode.py", line 88, in main
    process(args)
  File "encode.py", line 49, in process
    with open(args.input_source, 'r', encoding='utf-8') as f1, \
FileNotFoundError: [Errno 2] No such file or directory: '/home/crocoder/Desktop/transformers/PLBART/multilingual/data/processed/test.php-en_XX.php'

Any help on this?

wasiahmad commented 2 years ago

@gchhablani I am not sure. When we run prepare.sh, we run the following python script first (https://github.com/wasiahmad/PLBART/blob/main/multilingual/data/prepare.sh#L60).

PYTHONPATH=${HOME_DIR} python process.py;

This script create files with the format [train|valid|test].LANG-en_XX.LANG] under the data/processed directory. It's weird that you are getting the error because the process.py file should complain first if there is an issue.

gchhablani commented 2 years ago

@wasiahmad Thanks! I will check again the order in which things are run and get back. My processed directory is empty after download.sh. Will see what is causing the issue.

wasiahmad commented 2 years ago

This is what I get after completing the downloading step inside the data directory.

.
  |-go
  |  |-valid.jsonl
  |  |-test.txt
  |  |-train.txt
  |  |-valid.txt
  |  |-test.jsonl
  |  |-train.jsonl
  |-python
  |  |-valid.jsonl
  |  |-test.txt
  |  |-train.txt
  |  |-valid.txt
  |  |-test.jsonl
  |  |-train.jsonl
  |-java
  |  |-valid.jsonl
  |  |-test.txt
  |  |-train.txt
  |  |-valid.txt
  |  |-test.jsonl
  |  |-train.jsonl
  |-javascript
  |  |-valid.jsonl
  |  |-test.txt
  |  |-train.txt
  |  |-valid.txt
  |  |-test.jsonl
  |  |-train.jsonl
  |-encode.py
  |-ruby
  |  |-valid.jsonl
  |  |-test.txt
  |  |-train.txt
  |  |-valid.txt
  |  |-test.jsonl
  |  |-train.jsonl
  |-php
  |  |-valid.jsonl
  |  |-test.txt
  |  |-train.txt
  |  |-valid.txt
  |  |-test.jsonl
  |  |-train.jsonl
  |-download.sh
  |-process.py
  |-prepare.sh
gchhablani commented 2 years ago
.
├── download.sh
├── encode.py
├── go
│   ├── test.jsonl
│   ├── test.txt
│   ├── train.jsonl
│   ├── train.txt
│   ├── valid.jsonl
│   └── valid.txt
├── java
│   ├── test.jsonl
│   ├── test.txt
│   ├── train.jsonl
│   ├── train.txt
│   ├── valid.jsonl
│   └── valid.txt
├── javascript
│   ├── test.jsonl
│   ├── test.txt
│   ├── train.jsonl
│   ├── train.txt
│   ├── valid.jsonl
│   └── valid.txt
├── php
│   ├── test.jsonl
│   ├── test.txt
│   ├── train.jsonl
│   ├── train.txt
│   ├── valid.jsonl
│   └── valid.txt
├── prepare.sh
├── processed
├── process.py
├── python
│   ├── test.jsonl
│   ├── test.txt
│   ├── train.jsonl
│   ├── train.txt
│   ├── valid.jsonl
│   └── valid.txt
└── ruby
    ├── test.jsonl
    ├── test.txt
    ├── train.jsonl
    ├── train.txt
    ├── valid.jsonl
    └── valid.txt
wasiahmad commented 2 years ago

Seems ok, do you still encounter the issue?

gchhablani commented 2 years ago

The actual issue was that libclang-6.0 is needed. Probably a dependency for some package involved.

Doing the following fixed the issue:

sudo apt-get install -y libclang-6.0-dev
gchhablani commented 2 years ago

Closing this issue. Thanks a lot!

wasiahmad commented 2 years ago

Ah I see, the tokenizer supports the C++ language too. So, libclang is required. Perhaps, I should remove it since none of the downstream tasks is involved with the C++ language.