pylint-dev / pylint

It's not just a linter that annoys you!
https://pylint.readthedocs.io/en/latest/
GNU General Public License v2.0
5.31k stars 1.14k forks source link

False positives on ``pandas.io.parsers.TextFileReader`` #4577

Open adam-azarchs opened 3 years ago

adam-azarchs commented 3 years ago

Steps to reproduce

# pylint: disable=missing-module-docstring
# pylint: enable=unsubscriptable-object,unsupported-assignment-operation,no-member
import pandas as pd

data_frame = pd.read_csv("foo.csv")
print(data_frame.shape)
for column in data_frame.columns:
    data_frame[column] = data_frame[column].astype("S")

Current behavior

repro.py:7:14: E1101: Instance of 'TextFileReader' has no 'columns' member (no-member)
repro.py:8:4: E1137: 'data_frame' does not support item assignment (unsupported-assignment-operation)
repro.py:8:25: E1136: Value 'data_frame' is unsubscriptable (unsubscriptable-object)

Strangely, the no-member error goes away if you leave out the print statement.

Expected behavior

No errors, which was the case with pylint 2.7.x.

pylint --version output

Result of pylint --version output:

pylint 2.8.3
astroid 2.5.8
Python 3.7.10 (default, Jun  4 2021, 14:48:32)

Additional dependencies:

pandas==1.2.4
adam-azarchs commented 3 years ago

Actually looking more closely at the pandas code, I think the root cause may be that it's inferring the return type as TextFileReader where actually in this case it should be a DataFrame or Series, since we're not setting iterator or chunksize.

Pierre-Sassoulas commented 3 years ago

Thank you for creating the issue, I can reproduce this. Regarding the error that goes away with the print statement, could it be that there is two no-member error, one on 'TextFileReader' has no 'shape' member (no-member) on the print the other Instance of 'TextFileReader' has no 'columns' member (no-member) on the for loop ?

adam-azarchs commented 3 years ago

I never see an error for shape. Possibly TextFileReader does have a shape? I haven't looked too deeply into the pandas source code. But that wouldn't explain why columns only complains after that print.

Pierre-Sassoulas commented 3 years ago

I was thinking maybe you did not notice the warning on the print line. I have 4 warnings with your example one of them on the print line for "shape". Do you mean you have 3 warnings with the example you gave and the "columns" warning disappear if you remove the print ?

adam-azarchs commented 3 years ago

Yes.

Pierre-Sassoulas commented 3 years ago

OK, I cannot reproduce that with pandas 1.2.4 but the main problem is the false positive for no-member anyway.

adam-azarchs commented 3 years ago

In our codebase the main problem is actually the false positive on unsubscriptable-object. The no-member error can be easily worked around by setting generated-members. However as I said I believe the root cause is the same for both - the inferred type TextFileReader is not correct.

anders-kiaer commented 3 years ago

We see the same thing. Strangely enough the following snippet

#pylint: disable=missing-module-docstring,pointless-statement

import pandas as pd

df = pd.read_csv("some.csv")

df.columns
df.columns

gives Instance of 'TextFileReader' has no 'columns' member.

However if either

it goes away. :thinking:

anders-kiaer commented 3 years ago

Investigated a bit further. The false positive appears to have been introduced between astroid==2.5.7and astroid==2.5.8, in https://github.com/PyCQA/astroid/pull/1009. For the specific snippet above it looks like increasing max_inferred to >=166 removes the false positive in this case.

bersbersbers commented 3 years ago

I am seeing a similar issue with the code:

import io
import pandas

df = pandas.read_csv(io.StringIO("well\nx"))
df.loc[:, "well"] = df.well.str.replace("x", "y")

Instance of 'TextFileReader' has no 'well' member

I believe it is the same root cause, as the errors disappears after one of numerous manipulations. Maybe it is helpful as a test case once this bug is fixed.

spagh-eddie commented 3 years ago

I too have this issue. Interestingly commenting out the df.dropna(...) line removes the problem.

Edit: I have checked and this is also due to thinking it is a TextFileReader

# test.py 
import pandas as pd

df = pd.read_csv("input_filename")
df.dropna(subset=["title"], inplace=True)
have_NaN = df[["severity", "priority", "notice"]].isna().any(axis=1)
% # with dropna
% pylint test.py --disable=missing-module-docstring
************* Module test
test.py:5:11: E1136: Value 'df' is unsubscriptable (unsubscriptable-object)

--------------------------------------------------------------------
Your code has been rated at -2.50/10 (previous run: -2.50/10, +0.00)
% # after remove dropna
% pylint test.py --disable=missing-module-docstring

---------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: -2.50/10, +12.50)
% pylint --version
pylint 2.11.1
astroid 2.8.0
Python 3.7.5 (default, Aug 13 2020, 09:55:33) 
[Clang 11.0.3 (clang-1103.0.32.62)]
% python -c 'import pandas; print(pandas.__version__)'
1.3.3
ozyo commented 3 years ago

Huh, I wasn't imaging :joy: this really was a change in behavior.

When you remove inplace=True and instead assign the changed data frame from dropna to df again everything is fine.

I am using

pylint 2.9.6
astroid 2.6.6
pandas 1.3.2
shartzog commented 2 years ago

EDIT: I just noticed that the error message in https://github.com/PyCQA/pylint/issues/4577#issuecomment-930846829 above is also referencing the variable name, so some of what's below is redundant.

"inplace=True" seems problematic in multiple cases. The following produces a similar issue, but interestingly enough it does NOT reference "TextFileReader" as the unsupported type but rather the variable name itself. Possibly a related but independent issue?

import pandas as pd

data_frame: pd.DataFrame = pd.read_csv("foo.csv")
data_frame.fillna("", inplace=True)
data_frame["bar"] = data_frame[["baz", "bat"]].apply(
    lambda row: f'{str(row["baz"])}-{str(row["bat"])}',
    axis=1
)

The linting the above results in the following (on my system at least ;)):

5,0,error,unsupported-assignment-operation:'data_frame' does not support item assignment
5,27,error,unsubscriptable-object:Value 'data_frame' is unsubscriptable

Replacing the "inplace" operation with data_frame = data_frame.fill_na("") eliminates the error.

Version info:

pandas                    1.3.4
astroid                   2.9.0
pylint                    2.12.2
anders-kiaer commented 2 years ago

Based on the findings in https://github.com/PyCQA/pylint/issues/4577#issuecomment-871694490 we use the workaround below which might be useful for others here facing this issue.

Basically we for now increase astroid.context.InferenceContext.max_inferred to a higher value than the hard coded 100 using e.g.

[MASTER]

# As a temporary workaround for https://github.com/PyCQA/pylint/issues/4577
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"

in .pylintrc, or alternatively as a direct command line argument

pylint some_file_to_lint.py --init-hook "import astroid; astroid.context.InferenceContext.max_inferred = 500"
brycepg commented 2 years ago

Thank you for the example code @anders-kiaer


With max_inferred = 100
The return types are:
[<Instance of pandas.io.parsers.readers.TextFileReader>, Uninferable]

Only one positive return type


With max_inferred = 500 The return types are:
[<Instance of pandas.io.parsers.readers.TextFileReader>, Uninferable, <Instance of pandas.core.frame.DataFrame>]

Two positive return types


What do you think @Pierre-Sassoulas @PCManticore should unsubscriptable-object be raised if a singular type is returned with an Uninferable return type? I can understand that pandas is sacrificing consistent return types for usability with this function however it looks like unsubscriptable-object doesn't raise if there are multiple positive return types like with max_inferred=500

I think a brain could be created for read_csv if we don't want to change unsubscriptable-object.

Raising max_inferred really slows down this code -from 1s to 10s on my computer so I don't think it may be a good idea to change it @anders-kiaer


code for printing out types:



MAX_INFERRED = 500

import astroid
astroid.context.InferenceContext.max_inferred = MAX_INFERRED

ret = astroid.extract_node("""
import pandas as pd

df = pd.read_csv("some.csv")

df #@
df.columns
df.columns
""")
inferred = ret.inferred()
print(inferred)
FredStober commented 2 years ago

The same problem appears with fillna:

import pandas
dataframe = pandas.DataFrame({'A': [1,2,None]})
dataframe = dataframe.fillna(0)
print(dataframe['A'])

pylint will complain about E1136: Value 'dataframe' is unsubscriptable (unsubscriptable-object)