Dataset Collection -- First Round

lujiaying commented 2 years ago

Let's use this issue to store polished datasets. We expect 80%/5%/15% train/dev/test split. Cloud folder link

Dataset Stat

Dataset	#sample (train/dev/test)	#class	#feature	#cate_f	#num_f	#txt_f	#img_f
~~AnimalCrossing_Species~~	255/66/89	2	16	14	0	1	1
~~AnimalCrossing_Gender~~	331/20/62	2	16	14	0	1	1
Pokemon_type_1	719/45/133	18	25	8	13	3	1
Pokemon_type_2
✅Hearthstone_Class	8569/536/1605	14	19	6	7	5	1
✅Hearthstone_Rarity	8568/535/1607	6	19	6	7	5	1
✅Hearthstone_Cost	8568/536/1606	11	19	6	7	5	1
Hearthstone_Set	8566/533/1607	38	19	6	7	5	1
Hearthstone-Minion_Race
Hearthstone-Minion_Attack	5568/348/1043	13	14	4	4	5	1
Hearthstone-Minion_Health	5567/348/1044	13	14	4	4	5	1
Hearthstone-Spell_SpellSchool	2715/170/508	8	13	4	3	5	1

P.S. column definitions:

#sample (train/dev/test): num of rows/samples in train, dev, test set, e.g. 800/ 50/ 150
#class: num of classes to predict, e.g. 2 for binary classification
#feature: num of feature per sample, e.g. 10 as an example
#cate_f: num of categorical feature, e.g. 3
#num_f: num of numerical feature, e.g. 4
#txt_f: num of textual feature, e.g. 2
#img_f: num of image feature, e.g. 1

class distribution

Can we also add some stat about the class distribution of the built dataset? e.g. train: {'male': 30, 'female': 40}

AnimalCrossing_Gender train:{male:171, female:160} dev:{male:11, female:9} test:{male:32, female:30}

Binary Task Exp Results

Method	Dataset	acc	roc_auc	f1	precision	recall	log_loss
AG-best	AnimalCrossing_Gender	1.0	1.0	1.0	1.0	1.0
AG-medium	AnimalCrossing_Gender	0.98	0.97	0.98	1.0	0.97

Multiclass Task Exp Results

Method	Dataset	acc	balanced_acc	mcc	log_loss
AG-medium	AnimalCrossing_Species	0.06	0.05	0.02
AG-medium-mm	AnimalCrossing_Species	0.04	0.03	-0.002
AG-medium	HearthStone-All-cardClass	0.723	0.433	0.562
AG-medium-mm	HearthStone-All-cardClass	0.744	0.491	0.600
AG-medium	HearthStone-All-rarity	0.763	0.654	0.657	0.621
AG-medium-mm	HearthStone-All-rarity	0.763	0.652	0.658	0.615
AG-medium	HearthStone-All-cost	0.624	0.567	0.569	1.109
AG-medium-mm	HearthStone-All-cost	0.638	0.582	0.585	1.097
AG-medium	HearthStone-All-set	0.465	0.340	0.440	1.811
AG-medium-mm	HearthStone-All-set	0.469	0.348	0.445	1.841
AG-medium	HearthStone-Minions-attack	0.577	0.537	0.499	1.26
AG-medium-mm	HearthStone-Minions-attack	0.559	0.550	0.502	1.263
AG-medium	HearthStone-Minions-health	0.563	0.552	0.513	1.24
AG-medium-mm	HearthStone-Minions-health	0.563	0.552	0.513	1.24
AG-medium	HearthStone-Spells-spellSchool	0.835	0.602	0.653	0.533
AG-medium-mm	HearthStone-Spells-spellSchool	0.827	0.599	0.637	0.523

qyccc3 commented 2 years ago

I have made three CSV files for train, dev and test of the species in Animal Crossing. Species_dev.csv Species_test.csv Species_train.csv

lujiaying commented 2 years ago

From the label distribution, I'd suspect species prediction would be very challenging. Would you like to do a pilot experiment using AutoGluon to see the performance? If AutoGluon ends up discarding some rare class labels (would show some log information if that happens), we may need to stick on gender prediction task instead of species prediction.

We can start with CPU-only models. Some thing like below:

from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data=train_data)
predictions = predictor.predict(test_data)

qyccc3 commented 2 years ago

Team_dev.csv Team_test.csv Team_train.csv I have only included the first 4 abilities each hero have since there are some heroes with extra abilities. There are 63 heroes with "good" team attributes and 54 heroes with "bad" team attributes

lujiaying commented 2 years ago

So far all the csv files look great to me. Since we are going to create a benchmark which contains a bunch of datasets. I'd suggest we use Emory OneDrive to organize files: cloud folder link

The file structure can be:

./datasets
|-- AnimalCrossing_Gender
    |-- train.csv
    |-- dev.csv
    |-- test.csv
    |-- train_images.zip     
    |-- dev_images.zip
    |-- test_images.zip
|-- Dota2_Team
    |-- train.csv
    |-- dev.csv
    |-- test.csv
    |-- train_images.zip     
    |-- dev_images.zip
    |-- test_images.zip

Special notes for *_images.zip: 1. *_images.zip is a compressed archive of a folder; 2. please make sure train.csv has a column image that stores correct relative path to the image file.

train_images.zip
|-- image_1
|-- image_2
|-- ....

@qyccc3 due to the fact that Dota2 only has 117 heros, let's focus on generating a polished dataset of Animal Crossing for this week (Aug30). Please try to fill the table on the first comment in this thread(issue)

qyccc3 commented 1 year ago

Heartstone_minion.csv Heartstone_spell.csv These are the Heartstone minion and spell CSV without filling the empty cells.

I have uploaded Hearthstone_Minion to OneDrive with its images, CSV files, info.txt and trained predictor

lujiaying commented 1 year ago

Heartstone_minion.csv Heartstone_spell.csv These are the Heartstone minion and spell CSV without filling the empty cells.

Missing columns of Minion

text: manually checked Fearsome Doomguard, Firecat Form, Snowflipper Penguin, they are indeed with blank description.
race: we can fill out null as None_Race
TODO: whether AG can automatically handle columns like text and mechanics, what do AG thinks these columns are? Text or category?

Missing columns of Spell

rarity: according to this url, it is highly possible its rarity is free. Checked Nerubian Ambush!, Improved Ice Trap, Shadowy Gem, DIE, INSECT!...
Let's assign blank(keep cell empty) instead of 0 for columns health and attack

qyccc3 commented 1 year ago

Heartstone.csv

qyccc3 commented 1 year ago

pokemon_0421.csv

lujiaying commented 1 year ago

Heartstone.csv

For Heartstone, let's have the following types included: Minion, Spell, Weapon, Location

lujiaying commented 1 year ago

pokemon_0421.csv

For Pokemon, let's remove the following columns: egg_type_number, egg_type1, egg_type2, type_number against_normal, againstfire, ..., against*

Because these columns directly leak information about pokemon's type

lujiaying commented 1 year ago

Everything has been uploaded to overleaf.

lujiaying / MUG-Bench