mnamysl / nat-acl2021

MIT License
6 stars 0 forks source link

how can i get the data set <seq_lab_corpus> #1

Open aa452948257 opened 3 years ago

aa452948257 commented 3 years ago

Can you send me the Noisy Sequence Labeling Data Set, I can not get the write data following the readme text.

mnamysl commented 3 years ago

Hi @aa452948257 , thank you for your issue report.

Unfortunately, because of licensing/copyright reasons, I cannot send you the data set directly. Following the instructions in README.md, you need to download the original data set and restore the noisy annotations.

Which original data set did you use? What exact error message did you get?

aa452948257 commented 3 years ago

Thank you for your reply. What should I do after downloading the original data set? If I run it directly without processing, it will generate strange corpus

------------------ 原始邮件 ------------------ 发件人: "mnamysl/nat-acl2021" @.>; 发送时间: 2021年11月2日(星期二) 晚上9:45 @.>; @.**@.>; 主题: Re: [mnamysl/nat-acl2021] how can i get the data set <seq_lab_corpus> (Issue #1)

Hi @aa452948257 , thank you for your issue report.

Unfortunately, because of licensing/copyright reasons, I cannot send you the data set directly. Following the instructions in README.md, you need to download the original data set and restore the noisy annotations.

Which original data set did you use? What exact error message did you get?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

mnamysl commented 3 years ago

After downloading the original data set, please move it to the resources/tasks sub-directory. Its content should look like this (when you downloaded both original data sets):

tasks/
├── conll_03
│   ├── dev.txt
│   ├── test.txt
│   └── train.txt
└── ud_english
    ├── en_ewt-ud-dev.conllu
    ├── en_ewt-ud-test.conllu
    └── en_ewt-ud-train.conllu

Let's assume that we want to restore the noisy CoNLL data sets. To achieve this, we first need to call the conversion script as follows:

python3 main.py --mode ds_restore --corpus conll03_en

We can validate the checksum by calling:

python3 main.py --mode ds_check --corpus conll03_en

The output should look like this:

...
2021-11-02 16:12:32,714 tess3_01: True
2021-11-02 16:12:32,727 tess4_01: True
2021-11-02 16:12:32,736 tess4_02: True
2021-11-02 16:12:32,744 tess4_03: True
2021-11-02 16:12:32,750 typos: True

The conversion results are stored in the *resources/conversion/conll03en* directories. We can copy the files with the _restored suffix to the resources/task folder to be able to use the generated noisy data sets for evaluation. After completing these steps, the structure of our resources/tasks* directory should look as follows:

test/resources/tasks/
├── conll_03
│   ├── dev.txt
│   ├── test.txt
│   └── train.txt
├── conll03_en_tess3_01
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_01
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_02
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_03
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
├── conll03_en_tess4_typos
│   ├── dev_restored.txt
│   ├── test_restored.txt
│   └── train_restored.txt
└── ud_english
    ├── en_ewt-ud-dev.conllu
    ├── en_ewt-ud-test.conllu
    └── en_ewt-ud-train.conllu

I hope it helps :-)

aa452948257 commented 3 years ago

Thanks for your helping. I restore the noisy CoNLL data sets successfully, but I failed to validate the checksum.  Specifically, I debug the code and find that the generated md5_res is different from the resources/conversion/conll03_en_tess3_01/*.md5, and the I get the following:  … 2021-11-05 07:07:30,421 tess3_01: False 2021-11-05 07:07:30,433 tess4_01: False 2021-11-05 07:07:30,446 tess4_02: False 2021-11-05 07:07:30,466 tess4_03: False 2021-11-05 07:07:30,479 typos: False

------------------ 原始邮件 ------------------ 发件人: "mnamysl/nat-acl2021" @.>; 发送时间: 2021年11月2日(星期二) 晚上11:28 @.>; @.**@.>; 主题: Re: [mnamysl/nat-acl2021] how can i get the data set <seq_lab_corpus> (Issue #1)

After downloading the original data set, please move it to the resources/tasks sub-directory. Its content should look like this (when you downloaded both original data sets): tasks/ ├── conll_03 │   ├── dev.txt │   ├── test.txt │   └── train.txt └── ud_english ├── en_ewt-ud-dev.conllu ├── en_ewt-ud-test.conllu └── en_ewt-ud-train.conllu
Let's assume that we want to restore the noisy CoNLL data sets. To achieve this, we first need to call the conversion script as follows: python3 main.py --mode ds_restore --corpus conll03_en
We can validate the checksum by calling: python3 main.py --mode ds_check --corpus conll03_en
The output should look like this: ... 2021-11-02 16:12:32,714 tess3_01: True 2021-11-02 16:12:32,727 tess4_01: True 2021-11-02 16:12:32,736 tess4_02: True 2021-11-02 16:12:32,744 tess4_03: True 2021-11-02 16:12:32,750 typos: True
The conversion results are stored in the resources/conversion/conll03en* directories. We can copy the files with the _restored suffix to the resources/task folder to be able to use the generated noisy data sets for evaluation. After completing these steps, the structure of our resources/tasks directory should look as follows: test/resources/tasks/ ├── conll_03 │   ├── dev.txt │   ├── test.txt │   └── train.txt ├── conll03_en_tess3_01 │   ├── dev_restored.txt │   ├── test_restored.txt │   └── train_restored.txt ├── conll03_en_tess4_01 │   ├── dev_restored.txt │   ├── test_restored.txt │   └── train_restored.txt ├── conll03_en_tess4_02 │ ├── dev_restored.txt │ ├── test_restored.txt │ └── train_restored.txt ├── conll03_en_tess4_03 │ ├── dev_restored.txt │ ├── test_restored.txt │ └── train_restored.txt ├── conll03_en_tess4_typos │ ├── dev_restored.txt │ ├── test_restored.txt │ └── train_restored.txt └── ud_english ├── en_ewt-ud-dev.conllu ├── en_ewt-ud-test.conllu └── en_ewt-ud-train.conllu
I hope it helps :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

mnamysl commented 3 years ago

Thank you for your feedback. Does the same problem also occur with the UD English EWT data set?