nflverse / nflverse-data

Automated nflverse data repository
https://www.nflverse.com
Creative Commons Attribution 4.0 International
139 stars 12 forks source link

[BUG] unwanted line feeds within "injuries_YYYY.csv" #33

Closed henninghe closed 9 months ago

henninghe commented 9 months ago

Is there an existing issue for this?

Have you installed the latest development version of the package(s) in question?

What version of the package do you have?

not relevant, I just want to use the downloaded csv

Describe the bug

Within the csv files for injuries there are unwanted line feeds in column "practice_status" for line items where this field is empty. This leads to issues when loading the file e.g. with pandas.read_csv.

Example file: https://github.com/nflverse/nflverse-data/releases/tag/injuries/injuries_2023.csv Line Item 26 "Robert Tonyan" 2023-09-16_13h52_55

Reprex

not relevant, I just want to use the downloaded csv

Expected Behavior

I would expect that there are no line feeds withing "cells" of a csv file whatsoever.

nflverse_sitrep

not relevant, I just want to use the downloaded csv

Screenshots

No response

Additional context

No response

john-b-edwards commented 9 months ago

I have no issues either opening the csv in a typical file reader nor issues reading the specific file into pandas.

import pandas as pd
injuries = pd.read_csv('https://github.com/nflverse/nflverse-data/releases/download/injuries/injuries_2023.csv')
injuries.iloc[24]
#> season                                       2023
game_type                                     REG
team                                          CHI
week                                            1
gsis_id                                00-0033757
position                                       TE
full_name                           Robert Tonyan
first_name                                 Robert
last_name                                  Tonyan
report_primary_injury                        Back
report_secondary_injury                       NaN
report_status                        Questionable
practice_primary_injury                       NaN
practice_secondary_injury                     NaN
practice_status                            \n    
date_modified                2023-09-09T21:52:37Z
Name: 24, dtype: object

I see the new line character, but do not see why it would cause an issue loading the csv into python. Can you please give me a reproducible example illustrating your issue with loading this file into python? You can use the reprexpy package in python to create reprexes.

henninghe commented 9 months ago

Hi John, thanks for the fast reply. I made a mistake at my end. I used the wrong file path 'https://github.com/nflverse/nflverse-data/releases/tag/injuries/injuries_2023.csv' instead of 'https://github.com/nflverse/nflverse-data/releases/download/injuries/injuries_2023.csv' which lead to a "ParserError: Error tokenizing data." when trying to read the file with pandas.

I then downloaded the file manually and checked for potential issues and stumbled across the extra line feeds. After fixing them, I was able to load the file and assumed this was the root cause. Would have been a smart move to check reading the local file without any fixes first.

Sorry for occupying Your time on this and thanks for the response. Your example helped me to figure out the real issue at my end.

I love Your great efforts on this project! Regards Henning