turkish-nlp-suite / BeyazPerde-Movie-Reviews

Repo for Turkish movie reviews dataset.
https://www.turkish-nlp-suite.com
Creative Commons Attribution Share Alike 4.0 International
4 stars 1 forks source link

Duplicates #1

Closed rukiyesk closed 3 months ago

rukiyesk commented 4 months ago

Thank you for providing these excellent datasets. I am currently using the "Vitamin and Supplements", "Beyaz Perde All Movies", and "Beyaz Perde Best Movies" datasets from this repository for a sentiment analysis project. While working with the data, I encountered a potential issue and wanted to ask for your insights.

Firstly, I worked on the Vitamins dataset, and everything works fine. However, I suspect there might be data leakage or duplicates in the Beyaz Perde datasets after obtaining suspicious classification scores. When I checked the class distribution, it matched the number of classes reported in the paper. However, I found many duplicates in both Beyaz Perde datasets, especially in the "Best Movies" dataset, which has 49,635 overlapping entries.

Are you aware of any issues related to data leakage or duplicates in these datasets? Is there a recommended way to handle these potential issues to ensure the integrity of my analysis?

Thank you for your time and assistance. I look forward to your response.

DuyguA commented 4 months ago

Hellos, I remember de-dupping the dataset, however most probably I forgot to push it :grin: :grin: . I'm onto another task this week, I can push it next Monday most probably. If you want a bigger set, I have another one (not officially published but lives on HF) that I checkpointed with a BERT model and works fine. I can offer that too.

rukiyesk commented 4 months ago

Hello, Thanks a lot! The bigger dataset you mentioned sounds exciting, especially since it's been checkpointed with a BERT model. Looking forward to hearing from you :)

DuyguA commented 4 months ago

Sorry for the late answer, I went over the Beyazperde datasets and made some changes, please have a look now :wink: Additionally I added a Sinefil dataset if you'd like: https://huggingface.co/datasets/turkish-nlp-suite/sinefil-movie-reviews

Bigger dataset is a sentiment classification benchmark indeed including movies, customer reviews and a hate dataset. I'll do a model + datasets release soon, around in 1.5-2 months, you can track the status over Linkedin, feel free to send an invite :wave: :wave: