pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.24k stars 17.79k forks source link

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

Closed sshu2017 closed 13 hours ago

sshu2017 commented 1 month ago

Research

Link to question on StackOverflow

https://stackoverflow.com/questions/46397526/how-to-use-na-values-option-in-the-pd-read-csv-function

Question about pandas

I have a simple csv file that looks like this:

x,y,z
a,-99,100
b,-99,200
c,-99.0,300
d,-99.0,400

and when I tried a few different na_values, I got different column y back:

import pandas as pd

df1 = pd.read_csv('test.csv', na_values={"y": -99})
print("df1 = \n", df1)

df2 = pd.read_csv('test.csv', na_values={"y": -99.0})
print("\ndf2 = \n", df2)

df3 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]})
print("\ndf3 = \n", df3)

df4 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]})
print("\ndf4 = \n", df4)

Results:

df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df3 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I'm not sure if this is a bug or it is by design, so just throwing out a general question here. Thank you!

Pandas version is 2.2.1, just in case needed.

Lokeshram12 commented 1 month ago

the pandas library to read a CSV file where some values like -99 in the column y should be treated as missing values (NaN). However, the approach you are using na_values={"y":-99} might not work as expected because na_values is typically used to specify a list of strings that should be recognized as NaN. To correctly handle this situation in pandas, you can use the na_values parameter with a dictionary where keys are column names and values are lists of values to be treated as NaN.

df = pd.read_csv('test.csv', na_values={"y": [-99]})

The above you already used.Hope you got it

rhshadrach commented 1 month ago

@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots.

rhshadrach commented 1 month ago

The docstring for na_values states:

Additional strings to recognize as NA/NaN.

You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though.

sshu2017 commented 1 month ago

@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots.

Sorry. I just updated the post.

sshu2017 commented 1 month ago

The docstring for na_values states:

Additional strings to recognize as NA/NaN.

You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though.

I see! I tried using strings and things start to look better.

df5 = pd.read_csv('test.csv', na_values={"y": "-99"})
print("\ndf5 = \n", df5)

df6 = pd.read_csv('test.csv', na_values={"y": "-99.0"})
print("\ndf6 = \n", df6)

df7 = pd.read_csv('test.csv', na_values={"y": ["-99", "-99.0"]})
print("\ndf7 = \n", df7)

df8 = pd.read_csv('test.csv', na_values={"y": ["-99.0", "-99"]})

Results:

df5 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df6 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df7 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df8 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I think df7 and df8 are fine but df5 and df6 are still a little strange - the "-99" took care of both "-99" and "-99.0" in the df5 case, while the "-99.0" only took care of the "-99.0" in the df6 case. But all 4 of them are making more sense now.

Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised? If yes, I am more than happy to submit a PR for it. Thank you @rhshadrach !

rhshadrach commented 1 month ago

Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised?

An exception, I think.

sshu2017 commented 1 month ago

Hi @rhshadrach , seem like this issue has been discussed and dealt with 11 years ago. Maybe we could just cherry pick the commit #3841? Not sure why the changes made in the commit #3841 are not in the latest version, for example, the 3.0.0.dev0+1320.gd093fae3cd version I built locally.

rhshadrach commented 1 month ago

Maybe we could just cherry pick the commit #3841?

I would guess trying to cherry pick a commit from 11 years ago would be problematic.

It seems to me we should be testing for equality here when determining whether to make a replacement. I'm classifying this as a bugfix. Further investigations and PRs to fix are welcome!

sshu2017 commented 3 weeks ago

Hi @rhshadrach, I created a branch and now an ValueError would be raised when user send in a non-string value as na_values. But many tests indicate that non-string values are acceptable, for example: this test, this test, and this test. So I am wondering if this change is a bit too much. Please kindly advise. In case you want to see my code changes, here's the comparison of my branch with the main branch. (sorry it's a little messy because my PyCharm accidentally formatted the file a little bit).

Also I think the issue is in the c parser only and python parser is working as expected, as shown below. (pyarrow parser requires all na_values to be strings so it is all good).

Codes:

import pandas as pd
print("pd version = ", pd.__version__)

df1 = pd.read_csv('test.csv', na_values={"y": -99}, engine="python")
print("df1 = \n", df1)
df2 = pd.df1 = pd.read_csv('test.csv', na_values={"y": -99.0}, engine="python")
print("\ndf2 = \n", df2)
df3 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]}, engine="python")
print("\ndf3 = \n", df3)
df4 = pd.readdf1 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]}, engine="python")
print("\ndf4 = \n", df4)

df5 = pd.read_csv('test.csv', na_values={"y": -99}, engine="c")
print("\ndf5 = \n", df5)
df6 = pd.df1 = pd.read_csv('test.csv', na_values={"y": -99.0}, engine="c")
print("\ndf6 = \n", df6)
df7 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]}, engine="c")
print("\ndf7 = \n", df7)
df8 = pd.readdf1 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]}, engine="c")
print("\ndf8 = \n", df8)

Output:

pd version =  2.2.1
df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df3 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df5 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df6 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df7 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df8 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

Maybe we could fix the c parser? or make c parse behave like pyarrow parser and only accept strings for na_values?

rhshadrach commented 3 weeks ago

Ah - thanks. It looks like the documentation for it only taking strings was added here: https://github.com/pandas-dev/pandas/commit/20161d950fab02876dd78e67461383b3fe683c51.

Agreed we should not restrict na_values to only being strings.

sshu2017 commented 2 weeks ago

Hi @rhshadrach , sorry but just to confirm - you are suggesting that the c parser should be fixed and it should be able to take in not only

"-99"

but also

-99

Is it correct? If yes, I can start working on it but it may take me a while since I am not so familiar with C.

rhshadrach commented 1 week ago

@sshu2017 - yes, that is correct.