pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

BUG: Wrong Custom Formatters applied when displaying trancated frames #35410

Open ipcoder opened 4 years ago

ipcoder commented 4 years ago

Problem description

I am providing custom formatters for specific columns as dict. If frame is large enough and some columns are truncated - then wrong formatters are applied to the columns. (In my case that leads to crushes as wrong data type is received by the formatter).

Please notice, that behavior changes depending on the width of the console window as different columns are displayed.

Problem investigation

I have examined the code of my version of panda (1.0.5) and compared with the last version in GitHub - the bug seems to be still there.

The source of the problem starts with this method (DataFrameFormatter._to_str_columns), when frame is set to truncated frame = self.tr_frame and then self._format_col(i) is called with index of the column in the TRUNCATED frame:

    def _to_str_columns(self) -> List[List[str]]:
        """
        Render a DataFrame to a list of columns (as lists of strings).
        """
        # this method is not used by to_html where self.col_space
        # could be a string so safe to cast
        self.col_space = cast(int, self.col_space)

        frame = self.tr_frame
        # may include levels names also

        str_index = self._get_formatted_index(frame)

        if not is_list_like(self.header) and not self.header:
            stringified = []
            for i, c in enumerate(frame):
                fmt_values = self._format_col(i)

Then this "truncated" column index is passed to self._get_formatter:

    def _format_col(self, i: int) -> List[str]:
        frame = self.tr_frame
        formatter = self._get_formatter(i)   # the problem is HERE? _get_formatter(frame.columns[i]) ?

which uses full frame columns to retrieve formatter using index i which corresponds to the columns of the truncated frame:

           # ...
        else:
            if is_integer(i) and i not in self.columns:
                i = self.columns[i]
            return self.formatters.get(i, None)
INSTALLED VERSIONS ------------------ commit : None python : 3.6.10.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-37-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.0.5 numpy : 1.18.5 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 46.0.0.post20200309 Cython : 0.29.15 pytest : 5.4.1 hypothesis : 5.19.3 sphinx : 2.4.0 blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.16.1 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.2.2 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.15 tables : 3.4.4 tabulate : 0.8.3 xarray : 0.15.0 xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.8 numba : 0.50.1
rhshadrach commented 4 years ago

Thanks for reporting this - could you provide a minimal reproducible example of the data/code that demonstrates the issue?

ipcoder commented 4 years ago
import pandas as pd

def form(name):
    return lambda x: f"{name}: {x}"

df = pd.DataFrame({f"Col{x}":range(5) for x in range(6)})
print(df.to_string(formatters=formatters, max_cols=6))
print(df.to_string(formatters={c: form(c) for c in df}, max_cols=4))

produces:

     Col0    Col1    Col2    Col3    Col4    Col5
0 Col0: 0 Col1: 0 Col2: 0 Col3: 0 Col4: 0 Col5: 0
1 Col0: 1 Col1: 1 Col2: 1 Col3: 1 Col4: 1 Col5: 1
2 Col0: 2 Col1: 2 Col2: 2 Col3: 2 Col4: 2 Col5: 2
3 Col0: 3 Col1: 3 Col2: 3 Col3: 3 Col4: 3 Col5: 3
4 Col0: 4 Col1: 4 Col2: 4 Col3: 4 Col4: 4 Col5: 4
     Col0    Col1   ...      Col4    Col5
0 Col0: 0 Col1: 0   ...   Col2: 0 Col3: 0
1 Col0: 1 Col1: 1   ...   Col2: 1 Col3: 1
2 Col0: 2 Col1: 2   ...   Col2: 2 Col3: 2
3 Col0: 3 Col1: 3   ...   Col2: 3 Col3: 3
4 Col0: 4 Col1: 4   ...   Col2: 4 Col3: 4```

As you can see the second print uses wrong formatters after the truncated columns by selecting from the full instead of the truncated sequence of formatters.

I have patched my version as suggested above:

def _format_col(self, i: int) -> List[str]:
        frame = self.tr_frame
        formatter = self._get_formatter(frame.columns[i])   # instead of _get_formatter(i)
rhshadrach commented 4 years ago

Thanks - I can reproduce on master once I replace formatters=formatters in your example with the dictionary from the line below. This is indeed a bug, and your fix works well for the case where formatters are a dictionary, but I don't think it will work in the case of a list or tuple. Here, _get_formatter is really expecting an integer representing the position.

I think the root cause of the issue is that the columns attribute is not updated after the call to _chk_truncate in __init__.

Would you be interested in submitting a PR to fix?

ipcoder commented 2 years ago

I have never contributed to pandas development, and don't know the procedure. I assume some tests should be passed before I commit, and may be other things.

In addition to that, I have tried to follow your lead to see if columns attribute indeed should be updated, but it is used in different places, and the impact is difficult to estimate, especially without any comments describing the general logic and design intentions.
It seems I need to understand all the implicit assumptions behind the formatting flow, to be able to make changes.