Closed oschulte closed 6 years ago
@chaostewart thanks chao! How are you getting 666 + 1 + 1? Are the extra players at the maximum rank?
There must be a player in cohort 1 who has the max. CSS rank, so I assigned all players with no CSS rank in cohort 1 with this max. rank. Then I did the same to cohort 2.
So yes, the extra players are at the max rank instead of max rank plus 1. After we normalized the rank, I'm not sure if that plus 1 would make much difference.
Thank you Chao! It won't make a big difference but it's conceptually the right thing. It's interesting that treating "null CSS_rank" as another lower rank works well.
@liuyejia : which players are we including - everyone in the draft or just the people who get an NHL contract?
can you please clarify this in the paper and in the github?
The data only contains players who got an NHL contract. Yes, I will clarify this in the paper and github datasets.
Looks like Chao did a union of values from nhl.com and eliteprospects (e.g. Sidney Crosby has three seasons in the SQL data). But not for plus-minus, e.g. Sidney Crosby has 0 plus-minus in SQL, missing in Nhl, 78 in eliteprospects.
we should include the league as a source of variability. Update: we can do this by propagating information from Eliteprospects to the right views. Which are the right views? There should be three kinds:
nice new website for hockey data http://corsica.hockey/ . does it contain draft data?
let's add the major junior league as a feature. There are three. See https://en.wikipedia.org/wiki/List_of_ice_hockey_leagues#Junior.
which table contains chao's original crawling from draftanalyst? Is it chao_draft.draft_analyst_CSS_rank ?
do null values in chao_draft.join_skater_and_season_stats_10_years come from missing a season type
What happened to the missing values in draft analyst? Are they replaced by 0?
From chao stewart: Hi Oliver, an example of a “dirty” dataset would be table “ckm_and_exception_mining.draft_master_table_withRank_masterialized”. (edited)
[11:58] The problems in this table include: a. many 2s in ‘cescin_rank’; b. max value of each draft year is used for missing values in ‘cescin_rank’; c. many zero values for column ‘Age’. It’s also unclear if the Age is a player’s draft age, or age by the time the data was crawled, or today’s age; d. all values are zero for column ‘Shots’ and ‘shotPercentages’; e. most players from draft year 2003 have no CSS_rank, therefore, year 2003 should be left out. Note: cescin_rank was calculated by multiplying CSS_rank by a specific coefficient given in Schucker’s paper. In our decision tree project, we should use the original CSS_rank(final draft rank given by Central Scouting Services). @oschulte (edited)
A cleaned version of the same table is close to the view “chao_draft.all_skaters_stats_10_years_view”.
Yes, it is.
We probably should look at the dataset: chao_draft.join_skater_and_season_stats_10_years_view, which is the original joined table. The values in predictors like po_plusminus are missing because they are unavailable in eliteprospects.com, while the missing values in sum_7yr_TOI/GP is because they are not available in nhl.com.
The CSS_rank are kept as missing values until normalization(they are excepted from normalization: . They are replaced by 1. https://github.com/sfu-cl-lab/Yeti-Thesis-Project/blob/master/Decision_Trees/data_normalization/1st_cohort_5_year_norm.csv)
ps: The chao_draft.all_skaters_stats_10_years_view is the original table for players performance in the last season before they are drafted.