timvink / mkdocs-table-reader-plugin

MkDocs plugin that enables a markdown tag like {{ read_csv('table.csv') }} to directly insert various table formats into a page
https://timvink.github.io/mkdocs-table-reader-plugin/
MIT License
111 stars 18 forks source link

Non-string column headers cause an exception. #63

Closed davidorme closed 1 month ago

davidorme commented 2 months ago

If you set header=None in the arguments to read_excel then the resulting dataframe comes in with integer column names. That then breaks on this line:

https://github.com/timvink/mkdocs-table-reader-plugin/blob/9f5c1aae8c7445bec311f672ecde6b1c2da5d97c/mkdocs_table_reader_plugin/markdown.py#L27

with the exceptions:

  ...
  File "/.../mkdocs_table_reader_plugin/markdown.py", line 27, in <listcomp>
    df.columns = [replace_unescaped_pipes(c) for c in df.columns]
  File "/.../mkdocs_table_reader_plugin/markdown.py", line 18, in replace_unescaped_pipes
    return re.sub(r"(?<!\\)\|", "\\|", text)
  File "/.../python3.10/re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

The reason we're trying to do this is we want to display data that repeats a value in the first row and need to avoid pandas duplicated name mangling. However, the same bug occurs in normal use if a worksheet has a non-string entry in the first row.

The code below fixes the problem but feels a little bit hacky because the thead element of the rendered table is still there, containing those integer codes. Also, if the values are dates, then the user unavoidably gets the default string representation of the datetime object.

df.columns = [replace_unescaped_pipes(c) if isinstance(c, str) else str(c) for c in df.columns]
timvink commented 2 months ago

Happy to fix. Can you provide a reproducible example? So a very small .csv file and the arguments you used?

Then I'll write a unit test to ensure it doesn't come back ever :)

jacobcook1995 commented 2 months ago

Ahh great! So using the following contents

A,B,C 1,2,3 4,5,6

calling {{ read_csv('test.csv', header = None) }} generates the above error, as does {{ read_excel('Example.xlsx', sheet_name = 'test', header = None) }}.

The current work around we are using is calling {{ read_csv('test.csv', header = None, names = ["a", "b", "c"]) }}, which ensures that valid header names are provided.

davidorme commented 2 months ago

Just to add to that. If the data below is in a CSV file then {{ read_csv('test.csv') }} works. The same data in an Excel file ({{ read_excel('tmp.xlsx', tablefmt='github') }}) fails with the error above when it tries to sanitise the headers. I assume that this is because pandas.read_csv automatically constrains header entries to be string, but the dataframe object from an Excel file retains the cell format and so the 3 comes through as an int.

A,B,3
1,2,3
4,5,6
timvink commented 1 month ago

I implemented your suggestion to support non-string headers (in https://github.com/timvink/mkdocs-table-reader-plugin/commit/0b6201de52d5fcc96cce96386bfe936f90fdba44).

Indeed specifying header=None will leave you with an integer heading:

image

If you need more control, I suggest writing a mkdocs hook to write the table to a markdown file, and insert it using the snippets extension. See https://timvink.github.io/mkdocs-table-reader-plugin/howto/alternatives/#write-tables-to-markdown-files

timvink commented 1 month ago

Released https://pypi.org/project/mkdocs-table-reader-plugin/

davidorme commented 1 month ago

Fantastic. Thanks, @timvink - we'd adopted a hook to solve a more complex problem, but this really simplifies most of our use cases.