Closed W-lw closed 3 years ago
Hey,
According to what scripts it contains more then 100K examples? It's been a while, but these numbers seems right. I think that's the reason we ended up using 160K examples in total for training (40K in each sentiment-race pair) - https://arxiv.org/pdf/1808.06640.pdf
Thank you very much for your reply. Due to the frequent occurrence of the 100,000 number in the notebook, I mistakenly understood that each type of data was set to 100K. In fact, according to your setup in this paper this year, the total for the four categories is 100K. If the ratio is set to 0.5, then the number of each class is 25k. My question has been solved. Thank you again for your reply.
Glad to hear! Sorry, the code in this repository was written in quite a hustle, and could be improved. I'm happy to hear it got resolved though. Let us know if you have further questions!
I run the script
python make_data.py /path/to/downloaded/twitteraae_all /path/to/project/data/processed/X_race X race
of this repo and get the filtered datasetpos_pos, pos_neg, neg_pos, neg_neg
. The statistics are as follows: However, according to your scripts, these data are all over 100,000. May I ask what factors are responsible for this?