tarrade / proj_multilingual_text_classification

Explore multilingal text classification using embedding, bert and deep learning architecture
Apache License 2.0
5 stars 2 forks source link

Find imdb-like movie review datasets in FR, DE and IT #22

Closed tarrade closed 4 years ago

vluechinger commented 4 years ago

There might not be any available datasets in French and German for IMDb reviews.

Alternatives for us to use may be:

tarrade commented 4 years ago

I agree, the only think I found https://github.com/tchambon/deepfrench said "The movie reviews have been downloaded from a french imdb-like website and include 11K positives reviews, 11K negatives reviews as well as 51K unlabeled reviews for language model tuning" but these data are not public

tarrade commented 4 years ago

we didn't find any dataset ready to use so the only solution left so the solution was to use GCP Translate API to translate documents but thi is quite expensive: 20 CF per 1 million of character !

Some dataset collected by a Zurich company: tweet in German with 4 sentiments (positive, negative, mixed and unknow) https://www.spinningbytes.com/resources/germansentiment/