manueldeprada / Pretraining-T5-PyTorch-Lightning

Collection of scripts to pretrain T5 in unsupervised text, using PyTorch Lightning. CORD-19 pretraining provided as example.
31 stars 7 forks source link

Custom Dataset #1

Open adrian-jaques-b opened 3 years ago

adrian-jaques-b commented 3 years ago

Hello, First of all, thank you for sharing your code, I'm new to NLP and I've been looking for a while to find a solution on how to train a T5 transformer on an unsupervised manner.

Nevertheless I find the Cord dataset really confusing and since I want to use my own textual data I'm trying to adapt your code for my task. Since I want to use a plain .txt file for training, I think I can skip some of your python files which are used to preprocess the data? I'm writing because I've been facing some parsing errors. Therefore I'm wondering how the "cord19-standard.txt" data actually looks like before it gets fet into your "prepare_dataset.py?" Moreover, how does the preprocessed data (the val and test split) actually look like?

I would really appreachiate an answer and some examples of the data.

Best, Jaqu

manueldeprada commented 3 years ago

Hi, cord19-standard.txt is a plaintext file with a sentence per line. You can have a look for a month in the following link: https://nubeusc-my.sharepoint.com/:f:/g/personal/manuel_deprada_rai_usc_es/Ehl8gWaHa_tPs7rOPBZncGABiEE84D7zRxdlzRyvK9as5A?e=Rt0pve

The preprocessed data, is stored in binary format, ready to load using joblib:

image The json file stores the file sizes and which files belong to the val and test splits: { "dataset_1.jbl": 165736, "dataset_2.jbl": 181416, "dataset_3.jbl": 188815, "dataset_4.jbl": 215650, "dataset_5.jbl": 240161, "dataset_6.jbl": 221957, "dataset_7.jbl": 187024, "dataset_8.jbl": 173697, "dataset_9.jbl": 180977, "dataset_10.jbl": 195714, "dataset_11.jbl": 199769, "dataset_12.jbl": 197790, "dataset_13.jbl": 197618, "dataset_14.jbl": 199836, "dataset_15.jbl": 199698, "dataset_16.jbl": 193985, "dataset_17.jbl": 195966, "dataset_18.jbl": 201128, "dataset_19.jbl": 199224, "dataset_20.jbl": 199995, "dataset_21.jbl": 196050, "dataset_22.jbl": 199180, "dataset_23.jbl": 199004, "dataset_24.jbl": 198710, "dataset_25.jbl": 204689, "dataset_26.jbl": 204485, "dataset_27.jbl": 201802, "dataset_28.jbl": 202529, "dataset_29.jbl": 36610, "total_size": 5579215, "train": [ "dataset_1.jbl", "dataset_2.jbl", "dataset_3.jbl", "dataset_4.jbl", "dataset_5.jbl", "dataset_6.jbl", "dataset_7.jbl", "dataset_8.jbl", "dataset_9.jbl", "dataset_10.jbl", "dataset_11.jbl", "dataset_12.jbl", "dataset_13.jbl", "dataset_14.jbl", "dataset_15.jbl", "dataset_16.jbl", "dataset_17.jbl", "dataset_18.jbl", "dataset_19.jbl", "dataset_20.jbl", "dataset_21.jbl", "dataset_22.jbl" ], "valid": [ "dataset_23.jbl", "dataset_24.jbl", "dataset_25.jbl", "dataset_26.jbl", "dataset_27.jbl", "dataset_28.jbl", "dataset_29.jbl" ] }

Hope this helps you!

adrian-jaques-b commented 3 years ago

Hello,

Thank you so much for you answer! The training finally starts working!

I still have a few questions about the data preparation that you might be able to answer. I am currently trying to use a very small text file with 54 sentences and I 'm a little unsure about the parameters. I'm not aware of what "dumps" are in this context. Moreover, the test set is split into smaller files, but the sum of the data is not much bigger than the validation data set. (26 sentences for val /28 sentences for train --> not the expected 20/80 split) Do you have an ideas or any suggestions on how I should set these parameters? What can I use as a guideline here depending on the input text?

You can find the changed code snipplets (main() from the prepare_dataset.py) and de metadata.json attached as screenshots. Thanks a lot!

Best wishes, Jaqueline


Von: Manuel de Prada @.> Gesendet: Dienstag, 27. Juli 2021 12:36 An: manueldeprada/Pretraining-T5-PyTorch-Lightning @.> Cc: Boeck Jaqueline @.>; Author @.> Betreff: Re: [manueldeprada/Pretraining-T5-PyTorch-Lightning] Custom Dataset (#1)

Hi, cord19-standard.txt is a plaintext file with a sentence per line. You can have a look for a month in the following link: https://nubeusc-my.sharepoint.com/:f:/g/personal/manuel_deprada_rai_usc_es/Ehl8gWaHa_tPs7rOPBZncGABiEE84D7zRxdlzRyvK9as5A?e=Rt0pve

The preprocessed data, is stored in binary format, ready to load using joblib:

[image]https://user-images.githubusercontent.com/6536835/127140238-eef14966-0b89-4ae6-b2a5-124b7bc2113b.png The json file stores the file sizes and which files belong to the val and test splits: { "dataset_1.jbl": 165736, "dataset_2.jbl": 181416, "dataset_3.jbl": 188815, "dataset_4.jbl": 215650, "dataset_5.jbl": 240161, "dataset_6.jbl": 221957, "dataset_7.jbl": 187024, "dataset_8.jbl": 173697, "dataset_9.jbl": 180977, "dataset_10.jbl": 195714, "dataset_11.jbl": 199769, "dataset_12.jbl": 197790, "dataset_13.jbl": 197618, "dataset_14.jbl": 199836, "dataset_15.jbl": 199698, "dataset_16.jbl": 193985, "dataset_17.jbl": 195966, "dataset_18.jbl": 201128, "dataset_19.jbl": 199224, "dataset_20.jbl": 199995, "dataset_21.jbl": 196050, "dataset_22.jbl": 199180, "dataset_23.jbl": 199004, "dataset_24.jbl": 198710, "dataset_25.jbl": 204689, "dataset_26.jbl": 204485, "dataset_27.jbl": 201802, "dataset_28.jbl": 202529, "dataset_29.jbl": 36610, "total_size": 5579215, "train": [ "dataset_1.jbl", "dataset_2.jbl", "dataset_3.jbl", "dataset_4.jbl", "dataset_5.jbl", "dataset_6.jbl", "dataset_7.jbl", "dataset_8.jbl", "dataset_9.jbl", "dataset_10.jbl", "dataset_11.jbl", "dataset_12.jbl", "dataset_13.jbl", "dataset_14.jbl", "dataset_15.jbl", "dataset_16.jbl", "dataset_17.jbl", "dataset_18.jbl", "dataset_19.jbl", "dataset_20.jbl", "dataset_21.jbl", "dataset_22.jbl" ], "valid": [ "dataset_23.jbl", "dataset_24.jbl", "dataset_25.jbl", "dataset_26.jbl", "dataset_27.jbl", "dataset_28.jbl", "dataset_29.jbl" ] }

Hope this helps you!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/manueldeprada/Pretraining-T5-PyTorch-Lightning/issues/1#issuecomment-887403767, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQNWRF26FNGV4WX7YOFDWLLTZ2D4PANCNFSM5ACHY2LQ.

manueldeprada commented 3 years ago

Sorry for the delay, I didn't correctly receive the notification. If I don't answer you can reach out to me by email.

Hello,

Thank you so much for you answer! The training finally starts working!

I still have a few questions about the data preparation that you might be able to answer. I am currently trying to use a very small text file with 54 sentences and I 'm a little unsure about the parameters. I'm not aware of what "dumps" are in this context.

Dumps are the files of prepared data after preprocessing the dataset. They are call in that way beacuse they are not plaintext data, but serialized and dumped Python objects ready to be inngested by the model. There is multiple 100mb files instead of one monstrous GB-ish file so it doesn't bottleneck the pipeline and the data can be loaded dynamically.

Moreover, the test set is split into smaller files, but the sum of the data is not much bigger than the validation data set. (26 sentences for val /28 sentences for train --> not the expected 20/80 split) Do you have an ideas or any suggestions on how I should set these parameters? What can I use as a guideline here depending on the input text?

Pretraining on transformers should be done with lots of data. I would say at least 1000-2000 sentences and that probably wouldn't be enough either, but you could use some data augmentation techniques.

About the train/valid relation, this script is thought to keep the relation roughly when talking about these 100mb files... so if for example you have 11 100mb files, 9 go for train and 2 for test. So you can see how that can go wrong for smaller datasets.

You can find the changed code snipplets (main() from the prepare_dataset.py) and de metadata.json attached as screenshots. Thanks a lot!

Unfortunately, screenshots have not made it past the email.

Regards and best of luck!