shauli-ravfogel / nullspace_projection

MIT License
87 stars 21 forks source link

The amount of pos_pos and neg_pos is less than 100,000 #4

Closed W-lw closed 3 years ago

W-lw commented 3 years ago

I run the script python make_data.py /path/to/downloaded/twitteraae_all /path/to/project/data/processed/X_race X race of this repo and get the filtered dataset pos_pos, pos_neg, neg_pos, neg_neg. The statistics are as follows: image However, according to your scripts, these data are all over 100,000. May I ask what factors are responsible for this?

yanaiela commented 3 years ago

Hey,

According to what scripts it contains more then 100K examples? It's been a while, but these numbers seems right. I think that's the reason we ended up using 160K examples in total for training (40K in each sentiment-race pair) - https://arxiv.org/pdf/1808.06640.pdf

W-lw commented 3 years ago

Thank you very much for your reply. Due to the frequent occurrence of the 100,000 number in the notebook, I mistakenly understood that each type of data was set to 100K. In fact, according to your setup in this paper this year, the total for the four categories is 100K. If the ratio is set to 0.5, then the number of each class is 25k. My question has been solved. Thank you again for your reply.

yanaiela commented 3 years ago

Glad to hear! Sorry, the code in this repository was written in quite a hustle, and could be improved. I'm happy to hear it got resolved though. Let us know if you have further questions!