sfu-cl-lab / Yeti-Thesis-Project

for my thesis--Yejia Liu
2 stars 1 forks source link

draft data questions #20

Closed oschulte closed 6 years ago

oschulte commented 7 years ago
  1. why are there CSS ranks missing (e.g. in chao_draft.join_skater_and_season_stats_10_years_view we go from 249 to 242)?
  2. why are there 668 players at maximum rank in chao_draft.norm_dataset_for_lmt but only 666 players with CSS rank null in chao_draft.join_skater_and_season_stats_10_years_CSS_null?
chaostewart commented 7 years ago
  1. Not all players with CSS ranking can be found from player stat we crawled from nhl.com or eliteprospects.com.
  2. 666 players with null CSS ranks are set to have the maximum rank in corresponding cohort. So it's 666 + 1 + 1 = 688.
oschulte commented 7 years ago
  1. the missing ranks may be due to missing goalies.
  2. the 668 players may be missing ranked and unranked players
oschulte commented 7 years ago

@chaostewart thanks chao! How are you getting 666 + 1 + 1? Are the extra players at the maximum rank?

chaostewart commented 7 years ago

There must be a player in cohort 1 who has the max. CSS rank, so I assigned all players with no CSS rank in cohort 1 with this max. rank. Then I did the same to cohort 2.

chaostewart commented 7 years ago

So yes, the extra players are at the max rank instead of max rank plus 1. After we normalized the rank, I'm not sure if that plus 1 would make much difference.

oschulte commented 7 years ago

Thank you Chao! It won't make a big difference but it's conceptually the right thing. It's interesting that treating "null CSS_rank" as another lower rank works well.

oschulte commented 7 years ago

@liuyejia : which players are we including - everyone in the draft or just the people who get an NHL contract?

can you please clarify this in the paper and in the github?

liuyejia commented 7 years ago

The data only contains players who got an NHL contract. Yes, I will clarify this in the paper and github datasets.

oschulte commented 6 years ago

nba data https://www.quora.com/What-is-a-good-comprehensive-data-source-for-NCAA-mens-basketball?share=1

oschulte commented 6 years ago

Looks like Chao did a union of values from nhl.com and eliteprospects (e.g. Sidney Crosby has three seasons in the SQL data). But not for plus-minus, e.g. Sidney Crosby has 0 plus-minus in SQL, missing in Nhl, 78 in eliteprospects.

oschulte commented 6 years ago

nice new website for hockey data http://corsica.hockey/ . does it contain draft data?

oschulte commented 6 years ago

let's add the major junior league as a feature. There are three. See https://en.wikipedia.org/wiki/List_of_ice_hockey_leagues#Junior.

oschulte commented 6 years ago
  1. which table contains chao's original crawling from draftanalyst? Is it chao_draft.draft_analyst_CSS_rank ?

  2. do null values in chao_draft.join_skater_and_season_stats_10_years come from missing a season type

  3. What happened to the missing values in draft analyst? Are they replaced by 0?

From chao stewart: Hi Oliver, an example of a “dirty” dataset would be table “ckm_and_exception_mining.draft_master_table_withRank_masterialized”. (edited)

[11:58] The problems in this table include: a. many 2s in ‘cescin_rank’; b. max value of each draft year is used for missing values in ‘cescin_rank’; c. many zero values for column ‘Age’. It’s also unclear if the Age is a player’s draft age, or age by the time the data was crawled, or today’s age; d. all values are zero for column ‘Shots’ and ‘shotPercentages’; e. most players from draft year 2003 have no CSS_rank, therefore, year 2003 should be left out. Note: cescin_rank was calculated by multiplying CSS_rank by a specific coefficient given in Schucker’s paper. In our decision tree project, we should use the original CSS_rank(final draft rank given by Central Scouting Services). @oschulte (edited)

A cleaned version of the same table is close to the view “chao_draft.all_skaters_stats_10_years_view”.

liuyejia commented 6 years ago
  1. Yes, it is.

  2. We probably should look at the dataset: chao_draft.join_skater_and_season_stats_10_years_view, which is the original joined table. The values in predictors like po_plusminus are missing because they are unavailable in eliteprospects.com, while the missing values in sum_7yr_TOI/GP is because they are not available in nhl.com.

  3. The CSS_rank are kept as missing values until normalization(they are excepted from normalization: . They are replaced by 1. https://github.com/sfu-cl-lab/Yeti-Thesis-Project/blob/master/Decision_Trees/data_normalization/1st_cohort_5_year_norm.csv)

ps: The chao_draft.all_skaters_stats_10_years_view is the original table for players performance in the last season before they are drafted.