hitz-zentroa / GoLLIE

Guideline following Large Language Model for Information Extraction
https://hitz-zentroa.github.io/GoLLIE/
Apache License 2.0
263 stars 18 forks source link

generate dataset #2

Closed jmanhype closed 9 months ago

jmanhype commented 9 months ago

when generateing the data sets seems like its taking a very long time. not sure if it actually completing Repo card metadata block was not found. Setting CardData to empty. WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty. NcbiDisease-NER-dev: 100%|███████████████████| 924/924 [00:03<00:00, 238.68it/s] NcbiDisease-NER-test: 100%|██████████████████| 941/941 [00:04<00:00, 196.24it/s]

BC5CDR-NER-train-0: 13%|██▌ | 609/4561 [00:03<00:30, 131.50it/s] BC5CDR-NER-train-0: 38%|██████▊ | 1715/4561 [00:09<00:19, 143.23it/s] BC5CDR-NER-train-0: 61%|███████████ | 2796/4561 [00:14<00:07, 227.09it/s] BC5CDR-NER-test: 27%|█████▌ | 1272/4798 [00:09<00:27, 126.57it/s] BC5CDR-NER-test: 39%|████████▎ | 1890/4798 [00:14<00:23, 124.91it/s] BroadTwitter-NER-dev: 59%|█████████▍ | 1188/2002 [00:10<00:07, 115.30it/s] BroadTwitter-NER-dev: 88%|██████████████ | 1755/2002 [00:15<00:02, 114.49it/s]

BroadTwitter-NER-dev: 100%|████████████████| 2002/2002 [00:17<00:00, 115.12it/s] WNUT17-NER-test: 100%|██████████████████████| 1287/1287 [00:17<00:00, 71.52it/s] BC5CDR-NER-train-0: 100%|██████████████████| 4561/4561 [00:22<00:00, 199.49it/s] WNUT17-NER-train-0: 100%|██████████████████| 3394/3394 [00:25<00:00, 133.29it/s] BroadTwitter-NER-test: 100%|███████████████| 2002/2002 [00:13<00:00, 146.04it/s] CoNLL03-NER-test: 100%|█████████████████████| 3453/3453 [00:34<00:00, 99.02it/s]

BC5CDR-NER-train-0: 72%|████████████▉ | 3290/4561 [00:17<00:05, 223.53it/s] BC5CDR-NER-train-0: 75%|█████████████▌ | 3434/4561 [00:17<00:05, 224.29it/s] BC5CDR-NER-train-0: 100%|█████████████████▉| 4545/4561 [00:22<00:00, 222.88it/s] BC5CDR-NER-test: 100%|█████████████████████| 4798/4798 [00:36<00:00, 130.22it/s] FabNER-NER-dev: 100%|███████████████████████| 2183/2183 [00:46<00:00, 46.57it/s] BC5CDR-NER-train-24: 100%|█████████████████| 4561/4561 [00:25<00:00, 175.85it/s] WNUT17-NER-train-24: 100%|█████████████████| 3394/3394 [00:29<00:00, 115.00it/s] BC5CDR-NER-test: 90%|██████████████████▉ | 4341/4798 [00:34<00:03, 122.35it/s] BC5CDR-NER-test: 97%|████████████████████▍| 4660/4798 [00:36<00:00, 208.44it/s] ... (more hidden) ...

BC5CDR-NER-train-24: 42%|███████ | 1909/4561 [00:14<00:18, 141.97it/s] BC5CDR-NER-train-24: 89%|███████████████ | 4048/4561 [00:23<00:02, 228.24it/s] BC5CDR-NER-train-42: 100%|█████████████████| 4561/4561 [00:20<00:00, 221.22it/s] BC5CDR-NER-train-42: 28%|████▊ | 1276/4561 [00:05<00:14, 233.07it/sRepo card metadata block was not found. Setting CardData to empty.02, 235.46it/s] WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.ain-0: 46%|█████▍ | 2431/5342 [00:15<00:23, 125.62it/s] BroadTwitter-NER-train-0: 74%|████████▉ | 3979/5342 [00:28<00:11, 113.66it/s]

BC5CDR-NER-train-42: 100%|████████████████▉| 4559/4561 [00:20<00:00, 228.19it/s]

BroadTwitter-NER-train-0: 79%|█████████▍ | 4219/5342 [00:30<00:10, 112.01it/s] BroadTwitter-NER-train-0: 100%|████████████| 5342/5342 [00:36<00:00, 146.11it/s] WNUT17-NER-train-42: 100%|█████████████████| 3394/3394 [00:30<00:00, 111.55it/s] FabNER-NER-test: 100%|██████████████████████| 2064/2064 [00:37<00:00, 54.63it/s]

BroadTwitter-NER-train-0: 96%|███████████▍| 5114/5342 [00:35<00:01, 197.95it/s]

BC5CDR-NER-dev: 10%|██▎ | 456/4582 [00:03<00:34, 120.66it/s] CoNLL03-NER-train-0: 100%|███████████████| 14041/14041 [01:33<00:00, 149.84it/s] WNUT17-NER-dev: 100%|██████████████████████| 1009/1009 [00:08<00:00, 124.45it/s] BC5CDR-NER-dev: 100%|██████████████████████| 4582/4582 [00:27<00:00, 166.55it/s] BroadTwitter-NER-train-24: 100%|███████████| 5342/5342 [00:25<00:00, 213.61it/s] BroadTwitter-NER-train-42: 100%|███████████| 5342/5342 [00:24<00:00, 218.66it/s] CoNLL03-NER-train-24: 100%|██████████████| 14041/14041 [01:20<00:00, 174.10it/s] CoNLL03-NER-train-42: 100%|██████████████| 14041/14041 [01:19<00:00, 176.13it/s] CoNLL03-NER-dev: 100%|█████████████████████| 3250/3250 [00:20<00:00, 156.14it/s] OntoNotes5-NER-test: 100%|████████████████| 12217/12217 [04:51<00:00, 41.97it/s] MultiNERD-NER-test: 100%|█████████████████| 32908/32908 [10:59<00:00, 49.93it/s] NcbiDisease-NER-train-0: 100%|█████████████| 5433/5433 [00:15<00:00, 351.93it/s] NcbiDisease-NER-train-24: 100%|████████████| 5433/5433 [00:15<00:00, 352.46it/s] NcbiDisease-NER-train-42: 100%|████████████| 5433/5433 [00:15<00:00, 351.97it/s]

BC5CDR-NER-dev: 78%|█████████████████ | 3554/4582 [00:22<00:05, 194.35it/s] BC5CDR-NER-dev: 83%|██████████████████▏ | 3781/4582 [00:23<00:03, 221.54it/s] BC5CDR-NER-dev: 100%|█████████████████████▉| 4569/4582 [00:27<00:00, 211.22it/s] BroadTwitter-NER-train-24: 81%|████████▉ | 4352/5342 [00:20<00:05, 191.82it/s] BroadTwitter-NER-train-24: 95%|██████████▍| 5067/5342 [00:23<00:01, 187.31it/s] BroadTwitter-NER-train-24: 100%|██████████▉| 5327/5342 [00:24<00:00, 205.77it/s] BroadTwitter-NER-train-42: 100%|███████████| 5342/5342 [00:24<00:00, 202.92it/s]

ikergarcia1996 commented 9 months ago

Hi @jmanhype!

Generating the training dataset involves transforming each example into our code-based template and then applying the noising algorithms. Since we have many datasets, some of which are very large, and the process must be done 3 times (once for each pre-computed epoch), it takes a long time. This is expected. We will use all the CPU cores available. On our server with 64 cores, generating the dataset takes approximately 3 hours. If you have many cores available the tqdm output can be a little confusing. If you're uncertain whether the process is still ongoing, press enter several times to ensure the remaining tqdm progress bars write on a new line.

If you just want to run the evaluation, you can delete the train_file and dev_file lines in the configs/data_configs files. Generating the test split should be very fast (under 1 hour).