Open ipcoder opened 4 years ago
Thanks for reporting this - could you provide a minimal reproducible example of the data/code that demonstrates the issue?
import pandas as pd
def form(name):
return lambda x: f"{name}: {x}"
df = pd.DataFrame({f"Col{x}":range(5) for x in range(6)})
print(df.to_string(formatters=formatters, max_cols=6))
print(df.to_string(formatters={c: form(c) for c in df}, max_cols=4))
produces:
Col0 Col1 Col2 Col3 Col4 Col5
0 Col0: 0 Col1: 0 Col2: 0 Col3: 0 Col4: 0 Col5: 0
1 Col0: 1 Col1: 1 Col2: 1 Col3: 1 Col4: 1 Col5: 1
2 Col0: 2 Col1: 2 Col2: 2 Col3: 2 Col4: 2 Col5: 2
3 Col0: 3 Col1: 3 Col2: 3 Col3: 3 Col4: 3 Col5: 3
4 Col0: 4 Col1: 4 Col2: 4 Col3: 4 Col4: 4 Col5: 4
Col0 Col1 ... Col4 Col5
0 Col0: 0 Col1: 0 ... Col2: 0 Col3: 0
1 Col0: 1 Col1: 1 ... Col2: 1 Col3: 1
2 Col0: 2 Col1: 2 ... Col2: 2 Col3: 2
3 Col0: 3 Col1: 3 ... Col2: 3 Col3: 3
4 Col0: 4 Col1: 4 ... Col2: 4 Col3: 4```
As you can see the second print uses wrong formatters after the truncated columns by selecting from the full instead of the truncated sequence of formatters.
I have patched my version as suggested above:
def _format_col(self, i: int) -> List[str]:
frame = self.tr_frame
formatter = self._get_formatter(frame.columns[i]) # instead of _get_formatter(i)
Thanks - I can reproduce on master once I replace formatters=formatters
in your example with the dictionary from the line below. This is indeed a bug, and your fix works well for the case where formatters are a dictionary, but I don't think it will work in the case of a list or tuple. Here, _get_formatter
is really expecting an integer representing the position.
I think the root cause of the issue is that the columns
attribute is not updated after the call to _chk_truncate
in __init__
.
Would you be interested in submitting a PR to fix?
I have never contributed to pandas development, and don't know the procedure. I assume some tests should be passed before I commit, and may be other things.
In addition to that, I have tried to follow your lead to see if columns
attribute indeed should be updated, but it is used in different places, and the impact is difficult to estimate, especially without any comments describing the general logic and design intentions.
It seems I need to understand all the implicit assumptions behind the formatting flow, to be able to make changes.
Problem description
I am providing custom formatters for specific columns as dict. If frame is large enough and some columns are truncated - then wrong formatters are applied to the columns. (In my case that leads to crushes as wrong data type is received by the formatter).
Please notice, that behavior changes depending on the width of the console window as different columns are displayed.
Problem investigation
I have examined the code of my version of panda (1.0.5) and compared with the last version in GitHub - the bug seems to be still there.
The source of the problem starts with this method (
DataFrameFormatter._to_str_columns
), when frame is set to truncatedframe = self.tr_frame
and thenself._format_col(i)
is called with index of the column in the TRUNCATED frame:Then this "truncated" column index is passed to
self._get_formatter
:which uses full frame columns to retrieve formatter using index
i
which corresponds to the columns of the truncated frame: