tdopierre / ProtAugment

Code for ProtAugment: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning
Apache License 2.0
21 stars 13 forks source link

New dataset generation #6

Open Christoforos00 opened 2 years ago

Christoforos00 commented 2 years ago

Hello,

Thank you for your great paper and repo! I'd like to know the steps that I will need to follow to bring a new dataset in the template of your datasets. For example, how are all the files and folders in the ProtAugment/data/BANKING77 generated from the original dataset?

Thank you.

tdopierre commented 2 years ago

Hi Christoforos,

Thanks for your interest! Glad you liked the paper. To create the different training, validation, and test files, I used a script from another repository of mine, which you can find here prepate-intent-dataset.py In short, you need to have a full.jsonl file, containing all annotated samples, each row being a dictionary having a "label" key. Then, this script will separate labels found in this full.jsonl file into three sets

labels.train.txt
labels.valid.txt
labels.test.txt

To create the train.10samples.jsonl file (corresponding to the low data profile), once you have your labels.train.txt, for each of those labels, you need to gather 10 random samples. Unfortunately, I can't find the script for this part, but that should not be too complicated.

Hope that answers your question!

Christoforos00 commented 2 years ago

Great, thank you for your response, I will try the steps you mentioned.

Christoforos00 commented 2 years ago

Also, were the contents of the folders 01, 02, 03, 04, 05 inside BANKING77/few_shot created just by running prepate-intent-dataset.py 5 times?

tdopierre commented 2 years ago

Yes they are. You might want to fix the seed 5 times (e.g. seed={1, ..., 5}), so that you can reproduce the results if you lose the files.