pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.95k stars 18.04k forks source link

BUG: Comment in ODS-file gets included in string cells #55200

Closed pyZerrenner closed 1 year ago

pyZerrenner commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
df = pd.read_excel('TableWithComment.ods', engine='odf', header=0)
print(df)

Issue Description

Here is the example *.ods file for the example code: TableWithComment.ods. The file was create using LibreOffice Calc 7.4.7.2 TableWithComment

The print(df) command produces the following output

   Quantity 1 2023-09-19T00:00:00Comment IQuantity 2
0          10                                     20
1          11                                     21
2          12                                     22
3          13                                     23
4          14                                     24
5          15      2023-09-19T00:00:00Comment IIIABC
6          16                                     26
7          17                                     27
8          18                                     28

If a cell in the *.ods file has a comment and the cell content is a string (B1 and B7), the comment text and timestamp are appended in front of the cell content. This also applies to the header line. For cells containing numbers, the comment is ignored (A6 and B10).

(Note, that I am referring to comments inserted using LibreOffice Calc itself. This is unrelated to the comment argument of read_excel.)

Expected Behavior

The ods-comments should be ignored and only the cell content read into the dataframe. The expected output from print(df) is

   Quantity 1 Quantity 2
0          10         20
1          11         21
2          12         22
3          13         23
4          14         24
5          15        ABC
6          16         26
7          17         27
8          18         28

(This is the output when all comments in the *.ods file are deleted)

Installed Versions

INSTALLED VERSIONS ------------------ commit : ba1cccd19da778f0c3a7d6a885685da16a072870 python : 3.10.12.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : de_DE.cp1252 pandas : 2.1.0 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.2.1 Cython : None pytest : None hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.15.0 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : 2.4.0 pyqt5 : None (The installed odfpy version is 1.4.1, which somehow is not shown in the above output) Libre Office Version -------------------- Version: 7.4.7.2 (x64) / LibreOffice Community Build ID: 723314e595e8007d3cf785c16538505a1c878ca5 CPU threads: 12; OS: Windows 10.0 Build 17763; UI render: Skia/Raster; VCL: win Locale: de-DE (de_DE); UI: de-DE Calc: threaded
paulreece commented 1 year ago

I can confirm that this bug exists on the main branch 2.1.0.

rhshadrach commented 1 year ago

Thanks for the report! Further investigations and PRs to fix are welcome!

SuadHus commented 1 year ago

take

Leohemmingsson commented 1 year ago

take

dimastbk commented 1 year ago

Maybe just skip office:annotation?

diff --git a/pandas/io/excel/_odfreader.py b/pandas/io/excel/_odfreader.py
index 277f64f636..48677468c7 100644
--- a/pandas/io/excel/_odfreader.py
+++ b/pandas/io/excel/_odfreader.py
@@ -206,7 +206,11 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
             cell_value = cell.attributes.get((OFFICENS, "value"))
             return float(cell_value)
         elif cell_type == "string":
            return self._get_cell_string_value(cell)
         elif cell_type == "currency":
             cell_value = cell.attributes.get((OFFICENS, "value"))
             return float(cell_value)
@@ -228,8 +232,10 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
         """
         from odf.element import Element
         from odf.namespaces import TEXTNS
+        from odf.office import Annotation
         from odf.text import S

+        office_annotation = Annotation().qname
         text_s = S().qname

         value = []
@@ -239,6 +245,8 @@ class ODFReader(BaseExcelReader["OpenDocument"]):
                 if fragment.qname == text_s:
                     spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))
                     value.append(" " * spaces)
+                elif fragment.qname == office_annotation:
+                    continue
                 else:
                     # recursive impl needed in case of nested fragments
                     # with multiple spaces

Or extract only text:p/text:s (full list of possible elements here).

Leohemmingsson commented 1 year ago

@dimastbk I tried these changes and it looks good, you should open a MR :)