Data Sources & Dataset Sharing

lujiaying commented 2 years ago

From #1,

Data Collection (high-priority): we need to pay extra attention to copyright issue
- pokemon: https://bulbapedia.bulbagarden.net/wiki/Main_Page
- ~animal crossing~: https://docs.google.com/spreadsheets/d/13d_LAJPlxMa_DubPTuirkIV4DERBMXbrWQsmSh8ReK4/edit#gid=1022368750
- ~DOTA~: ~123 heroes; https://github.com/kriskate/dota-data (old); https://dota2.fandom.com/wiki/Table_of_hero_attributes
- ~LOL~: 161 heroes; https://www.op.gg/champions
- HearthStone https://hearthstonejson.com/ (first reference from: https://github.com/deepmind/card2code)
- Magic: https://scryfall.com/docs/api/bulk-data
- Farming Simulator Equipment: https://farmingsimulator.fandom.com/wiki/Equipment/Farming_Simulator_22, price prediction (regression), said over 400 equpiments
- Diablo II Items: https://diablo-archive.fandom.com/wiki/Items_(Diablo_II), Quality Level or Level Requirement prediction
- LOL: 1251 skins; http://ddragon.leagueoflegends.com/cdn/12.19.1/data/en_US/champion.json; https://lolskinshop.com/product-category/lol-skins/
- Elder Scrolls items/weapons/armors classification
- Legend of Zelda
- CS:GO skin (weapons): cs-go database

Possible options for sharing Data Set after submission

ScienceDB: no max quota. If there is no license concern, we can go with it.
Figshare: 20GB quota. If our final benchmark is less than 5GB, this one might be better.

lujiaying commented 1 year ago

Dataset Creation Process

For Pokemon, we collected a total of 897 pokemon data from https://bulbapedia.bulbagarden.net/wiki/Main_Page, we deleted the japanese_name and all attributes related to "against" to create the task for type_1 and type_2

For Hearthstone, we collected a total of 10710 hearthstone card data from https://hearthstonejson.com/, we split this dataset into 3 categories,hearthstone_minions, hearthstone_spells and hearthstone_all.

For hearthstone_minions, we have 6776 minions and we kept ["cardClass", "health", "id", "name", "set", "attack", "cost", "rarity", "artist", "collectible", "text", "mechanics", "race", "Image Path"], we create three tasks based on this: attack, health and race. @qyccc3 How about the race task, what did we do to regroup them?
For heathstone_spells, we have 3993 spells and kept ["cardClass", "health", "id", "name", "set", "cost", "rarity", "artist", "collectible", "spellSchool", "text", "mechanics", "Image Path"], we create one task based on this: spellSchool.
For hearthstone_all, we have 10710 hearthstone cards including: Minions, Spells, Weapons, Locations. We kept columns from the two categories above plus ["durability", "overload", "spellDamage"]. We create four tasks based on this category: cardClass, cost, rarity and set.

For Lol-skin, we collected a total of 1251 skins data from http://ddragon.leagueoflegends.com/cdn/12.19.1/data/en_US/champion.json and https://lolskinshop.com/product-category/lol-skins/. We kept ["SkinName", "Category", "Price", "Concept", "Model", "Particles", "Animations", "Sounds", "Release date" and "Sold ingame"]. We create one task based on this dataset: category.

lujiaying commented 1 year ago

\subsubsection{\pkm} \begin{itemize} \item Primary Type: There are 18 unique type_1 attributes, we dropped the column japanese_name and all attributes related to "against" to create the categorical classification task for type_1. \item Secondary Type: There are 18 unique type_2 attributes, similar to Pokémon Primary Type, we dropped the column japanese_name and all attributes related to “against” to create the categorical classification task for type_2. Because a portion of Pokémon has only one type, their type_2 classes are identified as None_Type. \end{itemize}

\subsubsection{Hearthstone} For Hearthstone, we collected a total of 10710 Hearthstone card data from https://hearthstonejson.com/. We split this dataset into 3 categories: Minions, Spells and All. \begin{itemize} \item Minions-race: For Hearthstone Minions, we filter the dataset by selecting ‘Minion’ in the ‘type’ attribute which yields 6776 minion data and we kept these columns: [‘cardClass’, ‘health’, ‘id’, ‘name’, ‘set’, ‘attack’, ’cost’, ‘rarity’, ‘artist’, ‘collectible’, ‘text’, ‘mechanics’, ‘race’, ‘Image Path’]. We created one task which is the categorical feature ‘race’. We deleted all the ‘race’ categories that have less than 5 testing data. There are 15 unique ‘race’ attributes. \item Spells-spellSchool: For Hearthstone Spells, we filter the dataset by selecting ‘Spell’ in the ‘type’ attribute which yields 3993 spells and we kept these columns: ["cardClass", "health", "id", "name", "set", "cost", "rarity", "artist", "collectible", "spellSchool", "text", "mechanics", "Image Path”]. We created one task which is the categorical feature ‘spellSchool’ based on this dataset. For spells that do not have ‘spellSchool’ attribute values, we assigned them with NONE_spellSchool. There are 8 unique ‘spellSchool’ attributes. \item All: For Hearthstone All, we combined all the minions, spells as well as locations and weapons. We kept all the columns from the two tasks above plus ["durability", "overload", "spellDamage”]. We created two tasks which are the categorical features ‘cardClass’ and ‘set’. \begin{itemize} \item cardClass: We deleted all the ‘cardClass’ attributes that do not have over 5 testing data. There are 13 unique ‘cardClass’ attributes. \item set: There are possible problems of data leakage given by the ‘id’ attribute of the dataset for prediction of the ‘set’ attribute. Therefore for this task, we changed the ‘id’ attribute to ‘anonymous_id’. There are 37 unique ‘set’ attributes. \end{itemize} \end{itemize}

\subsubsection{League Of Legends Skin} \begin{itemize} \item Category: For League of Legends, we collected a total of 1251 champion skin data. We kept these columns: [‘id’, ‘SkinName’, ‘Category’, ‘Price’, ‘Concept’, ‘Model’, ‘Particles’, ‘Animations’, ‘’Sounds’, ‘Release date’, ‘Sold ingame?]. We created one task which is the categorical feature ‘category’. There are 7 unique ‘category’ attributes. \end{itemize}

\subsubsection{Counter Strike: Global Offensive} \begin{itemize} \item Quality: For Counter Strike: Global Offensive, we collected a total of 956 skin data for guns, knifes and gloves. We kept these columns: [‘id’, ‘Skin Name’, ‘Skin Quality’, ‘Availability’, ‘Skin Category’, ‘Min Price’, ‘Max Price’, ‘Image Path’]. We create one task which is the categorical feature ‘Skin Quality’. There are 6 unique ‘Skin Quality’ attributes. \end{itemize}

lujiaying / MUG-Bench

Data Sources & Dataset Sharing #3

Dataset Creation Process