kevanoullio / apples-to-apples-agent

Apples-to-Apples game with AI agent using various natural language processing and machine learning techniques.
GNU General Public License v3.0
0 stars 2 forks source link

Make a custom subset of Google News dataset that contains only all A2A green/red apple words #61

Open kevanoullio opened 2 months ago

kevanoullio commented 2 months ago

The Google News dataset is approx 3.5gb but we only have 614 green apples, which have 3 synonyms each at 1,842 which gives 2,456 total green apple words, and 1,825 red apples (some of them have multiple words so on average about an additional 1,825) and the description with about 15 words on average for a total of 31,025 red apple words. So we could be loading a dataset that's only about 33,481 words instead of 3 million words.