lujiaying / MUG-Bench

Data and code of the Findings of EMNLP'23 paper MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields
https://aclanthology.org/2023.findings-emnlp.354/
Other
8 stars 0 forks source link

Data Sources & Dataset Sharing #3

Closed lujiaying closed 1 year ago

lujiaying commented 2 years ago

From #1,

Possible options for sharing Data Set after submission

lujiaying commented 1 year ago

Dataset Creation Process

For Pokemon, we collected a total of 897 pokemon data from https://bulbapedia.bulbagarden.net/wiki/Main_Page, we deleted the japanese_name and all attributes related to "against" to create the task for type_1 and type_2

For Hearthstone, we collected a total of 10710 hearthstone card data from https://hearthstonejson.com/, we split this dataset into 3 categories,hearthstone_minions, hearthstone_spells and hearthstone_all.

For Lol-skin, we collected a total of 1251 skins data from http://ddragon.leagueoflegends.com/cdn/12.19.1/data/en_US/champion.json and https://lolskinshop.com/product-category/lol-skins/. We kept ["SkinName", "Category", "Price", "Concept", "Model", "Particles", "Animations", "Sounds", "Release date" and "Sold ingame"]. We create one task based on this dataset: category.

lujiaying commented 1 year ago

\subsubsection{\pkm} \begin{itemize} \item Primary Type: There are 18 unique type_1 attributes, we dropped the column japanese_name and all attributes related to "against" to create the categorical classification task for type_1. \item Secondary Type: There are 18 unique type_2 attributes, similar to Pokémon Primary Type, we dropped the column japanese_name and all attributes related to “against” to create the categorical classification task for type_2. Because a portion of Pokémon has only one type, their type_2 classes are identified as None_Type. \end{itemize}

\subsubsection{Hearthstone} For Hearthstone, we collected a total of 10710 Hearthstone card data from https://hearthstonejson.com/. We split this dataset into 3 categories: Minions, Spells and All. \begin{itemize} \item Minions-race: For Hearthstone Minions, we filter the dataset by selecting ‘Minion’ in the ‘type’ attribute which yields 6776 minion data and we kept these columns: [‘cardClass’, ‘health’, ‘id’, ‘name’, ‘set’, ‘attack’, ’cost’, ‘rarity’, ‘artist’, ‘collectible’, ‘text’, ‘mechanics’, ‘race’, ‘Image Path’]. We created one task which is the categorical feature ‘race’. We deleted all the ‘race’ categories that have less than 5 testing data. There are 15 unique ‘race’ attributes. \item Spells-spellSchool: For Hearthstone Spells, we filter the dataset by selecting ‘Spell’ in the ‘type’ attribute which yields 3993 spells and we kept these columns: ["cardClass", "health", "id", "name", "set", "cost", "rarity", "artist", "collectible", "spellSchool", "text", "mechanics", "Image Path”]. We created one task which is the categorical feature ‘spellSchool’ based on this dataset. For spells that do not have ‘spellSchool’ attribute values, we assigned them with NONE_spellSchool. There are 8 unique ‘spellSchool’ attributes. \item All: For Hearthstone All, we combined all the minions, spells as well as locations and weapons. We kept all the columns from the two tasks above plus ["durability", "overload", "spellDamage”]. We created two tasks which are the categorical features ‘cardClass’ and ‘set’. \begin{itemize} \item cardClass: We deleted all the ‘cardClass’ attributes that do not have over 5 testing data. There are 13 unique ‘cardClass’ attributes. \item set: There are possible problems of data leakage given by the ‘id’ attribute of the dataset for prediction of the ‘set’ attribute. Therefore for this task, we changed the ‘id’ attribute to ‘anonymous_id’. There are 37 unique ‘set’ attributes. \end{itemize} \end{itemize}

\subsubsection{League Of Legends Skin} \begin{itemize} \item Category: For League of Legends, we collected a total of 1251 champion skin data. We kept these columns: [‘id’, ‘SkinName’, ‘Category’, ‘Price’, ‘Concept’, ‘Model’, ‘Particles’, ‘Animations’, ‘’Sounds’, ‘Release date’, ‘Sold ingame?]. We created one task which is the categorical feature ‘category’. There are 7 unique ‘category’ attributes. \end{itemize}

\subsubsection{Counter Strike: Global Offensive} \begin{itemize} \item Quality: For Counter Strike: Global Offensive, we collected a total of 956 skin data for guns, knifes and gloves. We kept these columns: [‘id’, ‘Skin Name’, ‘Skin Quality’, ‘Availability’, ‘Skin Category’, ‘Min Price’, ‘Max Price’, ‘Image Path’]. We create one task which is the categorical feature ‘Skin Quality’. There are 6 unique ‘Skin Quality’ attributes. \end{itemize}