Open adita15 opened 3 years ago
Can you please particularly point out what part confuses you?
I will write a generic workflow. First, you need to get the dataset. We have not posted the dataset here due to the policy of Constraint Shared Task. Get yourself registered with them to get the dataset and put it in the Dataset folder. The dataset must have two columns 1) the data 2) the labels. (we deleted the first row containing the column names and the first column containing serial numbers for each tweet). Now if you want to train the models yourself, make a directory by the name of models and run main_multitask_learning.py or main_bin_classification.py. If you wish to use our models, download them in the models folder in this directory. Now you can write your own script to generate results or use our script at your convenience. For anything specific, feel free to ask.
Thank you for prompt response.
These are the points that I want to confirm-
Thank you, Aditi Damle.
On Apr 12, 2021, at 10:00 PM, OJASV Kamal @.***> wrote:
Can you please particularly point out what part confuses you?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kamalojasv181/Hostility-Detection-in-Hindi-Posts/issues/4#issuecomment-818375461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTGLCGHRT7HCLVIK33KATTIOQVLANCNFSM42ZXXM2A.
1) Nope, for training, use training data, for validation, use validation data and generate csv on test data. 2) Use 10 epochs for all models. 3) Pass the one you want to generate results for. 4) Baseline model is the one mentioned in the paper by the workshop organisers (https://arxiv.org/abs/2011.03588). For the auxiliary approach, use the file main_multitask_learning.py . I can see why this might confuse someone. We were naive about the code. For now, use this info, I will update the repo in a day or two.
Thank you for this response. Could you just confirm the parameters passed to the classifer in place of model_path (ai4bharat/indic-bert ) or anything else?This confusion is mainly because some of them were experiments and only one was you selected approach. Just so that we could reproduce exact results.
Also, if I wish to use pretrained model from your model files, where do I specify that while fine tuning using main_multitask_classfer.py?
On Apr 12, 2021, at 10:42 PM, OJASV Kamal @.***> wrote:
Nope, for training, use training data, for validation, use validation data and generate csv on test data. Use 10 epochs for all models. Pass the one you want to generate results for. Baseline model is the one mentioned in the paper by the workshop organisers (https://arxiv.org/abs/2011.03588 https://arxiv.org/abs/2011.03588). For the auxiliary approach, use the file main_multitask_learning.py . I can see why this might confuse someone. We were naive about the code. For now, use this info, I will update the repo in a day or two. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kamalojasv181/Hostility-Detection-in-Hindi-Posts/issues/4#issuecomment-818389772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTGLAVJJD7MU4YOQLCGU3TIOVSLANCNFSM42ZXXM2A.
1) Yes, our best results were obtained on the ai4bharat/indic-bert using the auxiliary approach.
2) So the binaries released by us are already fine-tuned on the dataset of the workshop. You can either choose to fine-tune the original ai4bharat/indic-bert model on the same dataset and reproduce the models that we have released or just use our released models to directly generate results on the testset. There is no point fine tuning our model on the same dataset.
Anything else? Should I close it?
Hi Ojasv,
We really appreciate your responses and sorry to trouble you so much. Your responses so far have clarified most of our doubts. We just have final two queries.
I am attaching the SS of the results we have got using the pre-trained model files. Kindly let us know if these look correct.
I am okay with closing the issue and I hope I could ask you more questions in the future with regards to this problem of hate speech detection. Its really amazing work!
Thank you once again, Aditi Damle.
On Apr 12, 2021, at 11:18 PM, OJASV Kamal @.***> wrote:
Anything else? Should I close it?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kamalojasv181/Hostility-Detection-in-Hindi-Posts/issues/4#issuecomment-818400984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTGLGGKI2ZSE53PA7UV73TIOZYNANCNFSM42ZXXM2A.
1) Actually, we did a very sloppy job. We combined the train and valid data(in the CSV), and we accordingly passed the spilt parameter. For now, please bear with us. I have noted this and will fix it very soon.
2) Can you please elaborate? Are you talking about the baseline paper?
Its alright, We understand you as we are also students!
Here is the SS. The baseline model in the paper mentions SVM use. My understanding is that you guys are training the data on linear SVM and then comparing against your auxiliary approach to show the improvement. In such case, the data used here to fit SVM model is just the test set. So I am not clear about the purpose of this SVM code here in generate_csv.py
Best, Aditi
On Apr 12, 2021, at 11:59 PM, OJASV Kamal @.***> wrote:
Actually, we did a very sloppy job. We combined the train and valid data(in the CSV), and we accordingly passed the spilt parameter. For now, please bear with us. I have noted this and will fix it very soon.
Can you please elaborate? Are you talking about the baseline paper?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kamalojasv181/Hostility-Detection-in-Hindi-Posts/issues/4#issuecomment-818413045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTGLA2CAUKPYDO65JYHZLTIO6S3ANCNFSM42ZXXM2A.
Ok. Our bad again!. We actually tried ensembling in the generate csv code, which did not work out for us. This is not the baseline implementation but result generation with ensambling. We forgot to delete the code. Thanks for pointing out.
No problem!
So I will just delete all the SVM code for now.
For the baseline, I refereed the paper. I want to ask if you have implanted that with this dataset in you repo?
On Apr 13, 2021, at 12:14 AM, OJASV Kamal @.***> wrote:
Ok. Our bad again!. We actually tried ensembling in the generate csv code, which did not work out for us. This is not the baseline implementation but result generation with ensambling. We forgot to delete the code. Thanks for pointing out.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kamalojasv181/Hostility-Detection-in-Hindi-Posts/issues/4#issuecomment-818423707, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTGLGFJQJ6QTYRDXBNO6LTIPAMHANCNFSM42ZXXM2A.
I tried fine tuning using your script. I am still not able to reproduce the results. F1 scores lag by 2 digits for all tasks
I am also facing a similar issue. Are we supposed to train the model with batch size 16? The current version of the code is using batch_size=8. Also, the pre-trained models do not give identical results on running generate_csv.py. Could you please help me with this? FYI I am trying to reproduce results for AUX Indic Bert. Here are the results that were obtained after running main_multitask_learning.py to train/fine-tune the model.
I am tryin to reproduce these results and I am quite confused w.r.t. the structure. Could you provide detailed setup instructions?