pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.18k stars 17.77k forks source link

Inconsistent Handling of na_values and converters in read_csv #13302

Open gfyoung opened 8 years ago

gfyoung commented 8 years ago

On master (commit 40b4bb4):

>>> data = """A
1
CAT
3"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), na_values='CAT', converters={'A': f}, engine='c')
     A
0    1
1  CAT
2    3
>>> read_csv(StringIO(data), na_values='CAT', converters={'A': f}, engine='python')
     A
0    1
1  NaN
2    3

I expect both to give the same output, though I believe the Python output is more correct because it respects na_values unlike the C engine. I thought the simple fix would be to remove the continue statement here, but that causes test failures, so probably a more involved refactoring might be needed to align the order of converter application, NaN value conversion, and dtype conversion.

IMO this should be added to #12686, as this is a difference in behaviour between the two engines.

xref #5232

gfyoung commented 8 years ago

I have to say that decision to continue after applying a converter in the C-engine is really bugged. It causes a lot of other post-processing to be skipped. Another example (adapted from the parser tests):

>>> """A B
-999 1.200
2 -999.000
3 4.500"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), sep=' ', header=0,
             na_values=[-999.0, -999], converters={'B':f})
     A         B
0  NaN     1.200
1  2.0  -999.000
2  3.0     4.500

Per the original test, that -999.000 should become a NaN, but it doesn't because that post-processing is completely skipped because of this innocuous converter.

gfyoung commented 7 years ago

xref #15144 (closed):

>>> data = 'A,B,C\n10,-1,10\n10,10,10'
>>> read_fwf(StringIO(data), converters={'B': lambda x: x / 10.}, na_values={'B': [-1]})
 A    B  C
10 -0.1 10
10    1 10

Expected:

 A    B  C
10  NaN 10
10    1 10

The current behavior is due to the fact that converters are applied BEFORE na_values are, but consensus appears to be that we should apply na_values first.

patricktokeeffe commented 5 years ago

This is an insidious problem with the C engine! In 0.24.2, I observe implicit conversions also cause the parser to ignore the na_values parameter:

import io, pandas as pd
data = """\
Time,ConcRT(ug/m3),ConcHR(ug/m3),Flow(lpm),WS(m/s),WD(Deg),AT(C),RH(%),BP(mmHg),FT(C),FRH(%),BV(V),PM,Status
2018-06-27 21:00:00,+000012,+000007,+16.7,00.0,000,+022.6,025,704,+030.1,012,12.5,0,00000
2018-06-27 22:00:00,+000003,+000009,+16.7,00.0,000,+021.2,033,705,+028.6,015,12.5,0,00000
2018-06-27 23:00:00,+099999,+099999,+00.0,00.0,000,+019.6,038,705,+024.0,015,12.5,0,00001

"""

Note these are numeric values, but their representation probably requires an implicit converter be called.

The C engine will not locate NaNs unless the verbatim per-character value is presented to na_values:

pd.read_csv(io.StringIO(data), na_values=['99999'])
  Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12 7 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3 9 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 99999 99999 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=[99999])
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12 7 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3 9 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 99999 99999 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=['+099999'])
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1

This situation highlights the shortcomings of picking a simple 'before' or 'after' for converters to be applied. Consensus so far is to apply na_values before converters but I suspect certain converters should always be applied first.


Note the Python engine handles the situation identically (and "correctly") in all scenarios:

pd.read_csv(io.StringIO(data), na_values=['99999'], engine='python')
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=[99999], engine='python')
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=['+099999'], engine='python')
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
fynnwz commented 11 months ago

@patricktokeeffe @gfyoung I agree, this bug is insidious, very difficult to find and highly confusing. Here is a snippet to trace it down. It took me a couple of hours.

import requests 
import zipfile
from io import BytesIO
import pandas as pd
from datetime import datetime

zip_url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/solar/historical/10minutenwerte_SOLAR_00704_20020801_20091231_hist.zip"
response = requests.get(zip_url)
response.raise_for_status()

# Open the .zip file in memory
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # Extract the .txt file containing the data
    txt_file_name = z.namelist()[0]
    with z.open(txt_file_name) as file:
        df = pd.read_csv(file, sep=";", na_values="-999", engine="c")

print(df.loc[df['MESS_DATUM'] == 200709010900]['GS_10'])
print(df.loc[df['MESS_DATUM'] == 200709010910]['GS_10'])

This is my output:

262143   -999.0
Name: GS_10, dtype: float64
262144   NaN
Name: GS_10, dtype: float64

Some values of my data are correctly interpreted as NAN and some are not. Apperently at the two dates I selected above, is where the switch actually happens. Before "200709010900" every value of -999 is not interpreted as NAN. After "200709010910" all -999 values are correctly interpreted as NAN. I checked the csv-file manually and couldn't find any inconsistencies. However, once I change engine="python", it all works as expected. Am I missing something or is this indeed a weird bug?