nflverse / nflverse-rosters

builds roster data for nflverse/nflverse-data
Other
20 stars 4 forks source link

Why only passer, rusher, or receiver for the new id? #1

Closed vita10gy closed 4 years ago

vita10gy commented 4 years ago

The description says:

"The nflfastR play-by-play data uses special player IDs (UUIDs version 4). The roster data in this repo joins those IDs to the scraped rosters. Therefore only players who appear as passer, rusher, or receiver in the nflfastR play-by-play data will show the pbp_id. This ID can be used to join the roster data to the play-by-play data."

It looks like maybe you're using the jersey number from the play (desc) to link them up.

Wouldn't it be possible to merge the new ids for anyone who commits a penalty (penalty_player_id), kicker_player_id, pass_defense_1_player_id,solo_tackle_1_player_id, and so on?

vita10gy commented 4 years ago

Here's an example if it helps (though could be I'm missing something on how this is happening. Usually I pick up on languages fast enough to get at least a decent sense of what's happening even if I couldn't possibly write it. R looks like something on the side of an alien craft that crash landed to me.)

2020_01_NYJ_BUF Play 810

(5:03) (Shotgun) 17-J.Allen pass incomplete deep right to 15-J.Brown [97-N.Shepherd]. PENALTY on NYJ-35-P.Desir, Defensive Pass Interference, 33 yards, enforced at BUF 18 - No Play.

penalty_player_id => string (36) "32013030-2d30-3033-3133-33316ed0ffee" penalty_player_name => string (7) "P.Desir"

mrcaseb commented 4 years ago

The mapping of roster data indeed depends on jersey-numbers. They are being extracted with computationally heavy string parsing with complicated regular expressions. We decided to do this only for passer, rusher and receiver for now as we haven't had time to test the parser for other ..._player_id columns. We would also have to create a new ..._jersey_number column for each of the ..._player_id which is kind of a mess so I am not quite sure if we add jersey number columns at all.

vita10gy commented 4 years ago

Well I'm biased, but if just the penalty_player_id one was sorted out it would potentially cycle through all the kinds of players (where as pass_defense_1 is limited, and things like kicker_player_id almost may as well be something done by hand.)

You'd probably have most of the olinemen mapped by week 5, a lot of the guys that would end up being found out by pass_defense_1 will have a holding call sooner or later, and so on.

Edit: For the record, I'm completely agnostic on adding penalty_player_jersey_number to the main data or not.

mrcaseb commented 4 years ago

We finally have a decoder for the player IDs and can decode all id columns to the old GSIS IDS.

I have to change the roster code now but this will be solved in the near future.

vita10gy commented 4 years ago

Nice! What did the secret turn out to be?

mrcaseb commented 4 years ago

Nice! What did the secret turn out to be?

It's a more or less easy hex decoding of a part of the pbp IDs. Someone pointed to me how to do that and I coded it.

I am going to close this issue as we will bring back fast_scraper_roster() to nflfastR along with a new function decode_player_ids().

vita10gy commented 4 years ago

Will there still be a CSV of the data somewhere?

mrcaseb commented 4 years ago

See https://github.com/mrcaseb/nflfastR-roster/tree/master/data/seasons

vita10gy commented 4 years ago

So, to put a finer point on it, is https://github.com/mrcaseb/nflfastR-roster/blob/master/data/nflfastR-roster.csv.gz no longer getting updates? (or some other file that attempt to marry the new play by play ids and the old 00-00xxxxx way?)

mrcaseb commented 4 years ago

I don't plan to do so. The season rosters are available as csv in data/seasons/ and nflfastR delivers a function to decode the new ids. If you are using Python instead I can send you a link with the Python function for id decoding.

vita10gy commented 4 years ago

I'm actually not using either for processing, I was just using the straight CSV's/Jsons as they had everything I needed in them, but if that's how it's going to be I guess I can't complain.

I'm using PHP, but if you had that function in python I could probably figure it out.

mrcaseb commented 4 years ago

The code snippet I used is from here. I've done it in R of course and in a more comfortable way but it's actually this:

def convert_gsis_id(new_id):
    # 32013030-2d30-3032-3334-35395dc60da5
    # XXXX3030-2d30-3032-3334-3539XXXXXXXX
    # '00-0023459'
    return codecs.decode(new_id[4:-8].replace('-',''),"hex").decode('utf-8')
vita10gy commented 4 years ago

Looks like this did the trick if anyone else needs it in PHP. I was thinking it actually went the other way, but I see now which way it goes and that it can't be a 2 way street.

function convert_to_gsis_id($new_nfl_id)
  {
    $return = $new_nfl_id;
    if(strlen($new_nfl_id) == 36)
    {
      $return = '';
      $str =  substr(str_replace('-','',$new_nfl_id),4,-8);
      foreach(str_split($str,2) as $chunk)
      {
        $return .= chr(hexdec($chunk));
      }
    }
    return $return;
  }