Data structure - Githubissues

kazooki117 commented 5 years ago

Context

We want to create a consistent DB for all draft MTG data gathered around. For that we will setup a database hosted in the cloud that everyone can access to with the proper ids.

Objective

As a first step we need to determine what will be the structure of our schema. The purpose of this ticket is to discuss the structure (tables and relations between them) of this database.

ValentinPerret commented 5 years ago

I propose to consolidate all the datasources in a Postgres DB hosted on AWS. We need to think about the schema structure. (We can use a tool like SQLdbm to visualize data structure and link between tables.)

My first pass would be the following tables in the schema for deck storage:

cards: card_id, [card_infos] (imported from JSONMTG)
deck : deck_id, session_id [optional, if draft log available], score_game_1 [int], score_game_2 [int], score_game_3 [int], score_total [calculated from the score game 1,2 and 3]
deck_details: deck_details_id, deck_id , card_id , card_num [int]

For draft data I would say:

session: session_id, session_date [timestamp]
draft_log: draft_id,session_id , pack_num [int], pick_num. [int], is_picked [bool] (we would store the list of all the cards seen here and put true on the is_picked column for the card we take)

All of the table can be linked on their id parameters.

OvidiuCalburean commented 5 years ago

We should offer support for Sealed format as well. This should allow up until 9 score sets (to accommodate Limited GPs)
We should be able to differentiate MTGO friendly/MTGO intermediate/MTGO competitive/paper/MTGA normal/ MTGA competitive events.
We should store information of cards that did not make the deck (sideboard). We should do analysis on this as well (not for V1 but next releases)
We should build the schema structure to easily support future sets.

nebanche commented 5 years ago

Hey all, Lemon_Tea here from the discord!

What does score on deck represent? 0 for L, 1 for Win? It might just be better to have win, loss, tie as the values instead of trying to aggregate values and have the meaning of it hidden.

Would it help to have WURBG value on the deck data as well?

I'm also confused here by DraftLog, is this a table for the entire draft or just a given pick?

If this is a given pick, it should be represented by a list of cards(given by card id) and then the pick_id of the card that is picked. With the current schema it looks to be a sort of N+1 problem where we are going to need to be storing a lot more data on a given record then what you would need.

sjb9774 commented 5 years ago

I think it's important to structure the data based on what the typical queries on it may look like. What useful questions would we want to be able to ask of the data? I would imagine questions like this:

How often are blue decks drafted?
What is the win rate of decks that have less than 3 rares?
What pick # is on average in the first pack?
How many two drops does the average 3-0 deck play?

I say this just so there is due respect given to how granular one would need to be in order to produce a truly useful database. I've tried doing this myself before (and actually got somewhat far, see here: https://github.com/sjb9774/MTGDB-API), ultimately I abandoned it as it was overly ambitious.

Instead of importing all the card data from MTGJSON as suggested by @ValentinPerret , I might suggest just using the multiverse id (or, if there's a better unique identifier then use that) and expect users of the database to use the Scryfall API (or some other) to pull detailed card info in their own applications and leave this database to just be a repository or well-formed draft logs. In that way a lot of the processing is offloaded to the individual users (which is unfortunate) but I think ultimately smarter.

rconroy293 commented 5 years ago

Just added a branch that has a basic data structure and imports MTGO draft logs here: https://github.com/kazooki117/mtg-data-analysis/tree/importer/importer. Definitely some room for improvement in the data structure, e.g.

As @sjb9774 suggested, it's probably good to just use the multiverse id to identify cards (though the way the mtgjson data structure handles split cards has two "cards" map to one multiverse id)
It's probably better just to use the expansion abbreviation as the set identifier than having a separate id column

This also currently has no support for recording deck performance, nor for how to incorporate sealed decks.

ValentinPerret commented 5 years ago

@nebanche:

The table draft_log would store the entire draft. You have a line per card. So pack 1 pick one would have 15 cards. Then P1P2 14 cards etc. Using this structure allows us to store all the information about the draft (and will allow to do analysis on pick orders and pick of cards over other cards in the future).
I was thinking recording the score using a code like this one: (2-0) --> 2, (2-1) --> 1, (1-1) --> 0, (1-2) --> -1, (0-2) --> -2 (issue: I miss 1-0 and 0-1... :s)
Storing WURBG value is redundant with the list of cards. I feel like we could add this calculation in a view(see comment below) and not in the base table.

@sjb9774:

Agree for the multiverse_id. I did not know a global id existed!
I was thinking importing only base info from MTGJSON in the DB (like manacost, rarity, card_type and P/T for example) I agree that we can let people extract additional data if they want to.
Concerning the data structure, my view is that it is best to build the base tables to have the maximum data available in a clear way. Then we can build postgres views on top of the base tables that would present the data to be queried easily for specific use cases as you described.

rconroy293 commented 5 years ago

https://github.com/kazooki117/mtg-data-analysis/pull/14 updates the schema as follows:

expansions:

abbreviation: String(3), nullable=False, primary_key=True
name: String(255), nullable=False
max_booster_number: Integer, nullable=False

cards:

id: Integer, primary_key=True
multiverse_id: Integer, nullable=False
expansion: String(3), ForeignKey('expansions.abbreviation'), nullable=False
name: String(255), nullable=False
rarity: String(255), nullable=False
number: Integer, nullable=False
face: String(255)
mana_cost: String(255)
text: String(1023)
type: String(255), nullable=False
power: Integer
toughness: Integer
loyalty: Integer

users:

id: Integer, primary_key=True
username: String(255), nullable=False

drafts:

id: Integer, primary_key=True
user: Integer, ForeignKey('users.id'), nullable=False
name: String(255), nullable=False
start_time: DateTime, nullable=True

packs:

id: Integer, primary_key=True
draft: Integer, ForeignKey('drafts.id'), nullable=False
pick_number: Integer, nullable=False

pack_cards:

id: Integer, primary_key=True
pack: Integer, ForeignKey('packs.id'), nullable=False
card_multiverse_id: Integer, nullable=False

picks:

id: Integer, primary_key=True
pack_card: Integer, ForeignKey('pack_cards.id'), nullable=False

Querying all the cards for a draft looks like:

SELECT
    packs.pick_number,
    GROUP_CONCAT(cards.name ORDER BY cards.face SEPARATOR '/') pick
FROM packs
JOIN pack_cards ON (packs.id = pack_cards.pack)
JOIN cards ON (pack_cards.card_multiverse_id = cards.multiverse_id)
JOIN picks ON (picks.pack_card = pack_cards.id)
WHERE packs.draft = 2
GROUP BY packs.pick_number
ORDER BY packs.pick_number

This leaves only the decks and records left that need schemas.

kazooki117 / mtg-data-analysis

Data structure #1

Context

Objective