jmcnamara / libxlsxwriter

A C library for creating Excel XLSX files.
https://libxlsxwriter.github.io
Other
1.53k stars 336 forks source link

Using the library in a docker results in a segfault on workbook_get_worksheet_by_name #461

Closed BinarSkugga closed 1 month ago

BinarSkugga commented 1 month ago

Hello,

I am trying to use and run this library using Python. It works great on my computer but when I try to do it inside of a docker container it results in a segfault. I'm using ctypes and libxlsxwriter 1.1.8. I ran it using gdb, here's the full log:

Starting program: /work/venv/bin/python -u service.py -ex
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f3f1ea00700 (LWP 37)]
[New Thread 0x7f3f1e000700 (LWP 38)]
[New Thread 0x7f3f19600700 (LWP 39)]
[New Thread 0x7f3f16c00700 (LWP 40)]
[New Thread 0x7f3f16200700 (LWP 41)]
[New Thread 0x7f3f11800700 (LWP 42)]
[New Thread 0x7f3f0ee00700 (LWP 43)]
[New Thread 0x7f3f0c400700 (LWP 44)]
[New Thread 0x7f3f09a00700 (LWP 45)]
[New Thread 0x7f3f07000700 (LWP 46)]
[New Thread 0x7f3f04600700 (LWP 47)]
[New Thread 0x7f3f01c00700 (LWP 48)]
[New Thread 0x7f3eff200700 (LWP 49)]
[New Thread 0x7f3efc800700 (LWP 50)]
[New Thread 0x7f3ef9e00700 (LWP 51)]
[New Thread 0x7f3ef9400700 (LWP 52)]
[New Thread 0x7f3ef4a00700 (LWP 53)]
[New Thread 0x7f3ef2000700 (LWP 54)]
[New Thread 0x7f3eef600700 (LWP 55)]
[New Thread 0x7f3eecc00700 (LWP 56)]
[New Thread 0x7f3eea200700 (LWP 57)]
[New Thread 0x7f3ee7800700 (LWP 58)]
[New Thread 0x7f3ee6e00700 (LWP 59)]
Test 1

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f3f22acd10c in workbook_get_worksheet_by_name (
    self=self@entry=0x610b5d30, name=name@entry=0x55ba615b8e90 "test")
    at workbook.c:2669
2669        if (!name)

Here's my dockerfile for building and running it:

# Seconds stage builds the final image. It packages the application and the venv created in the previous stage.
FROM python:3.12.7-slim-bullseye

WORKDIR /work

RUN apt-get update -y
RUN apt-get install -y git cmake zlib1g-dev

# Debug setup
RUN apt-get install -y gdb strace
ENV CFLAGS="-g -O0"

# Copy our service inside the final image.
COPY service.py .
COPY setup.cfg .
COPY src ./src

# RUN git clone https://github.com/jmcnamara/libxlsxwriter.git
COPY resources ./resources
RUN cd resources/libxlsxwriter-1.1.8 && make V=1
RUN cp ./resources/libxlsxwriter-1.1.8/lib/libxlsxwriter.so ./resources/libxlsxwriter.so

EXPOSE 8080

# Set our service's entrypoint as the command to be ran upon start of a container using this image.
CMD gdb --ex run --args python -u service.py -ex

And finally the python code, although it works on my host:

class Workbook:
    def __init__(self, book_name: str, options: WorkbookOptions, xlsxlib: Any):
        self.xlsx = xlsxlib

        self.name = book_name
        self.options = options
        self._sheets: Dict[int, Worksheet] = {}
        self._formats: Dict[int, Format] = {}

        tmp_base_path = os.path.join(os.getcwd(), "resources", "tmp")

        self._c_book = self.xlsx.workbook_new_opt(
            cstring(os.path.join(tmp_base_path, self.name)),
            cref(self.options.to_cstruct())
        )

    def add_sheet(self, sheet: Worksheet) -> Worksheet:
        c_sheet = self.xlsx.workbook_add_worksheet(self._c_book, cstring(sheet.name))
        self._sheets[c_sheet] = sheet
        sheet.owner_book = self
        sheet.c_id = c_sheet
        return sheet

cstring does a ctypes.c_char_p, it fails on workbook_add_worksheet

jmcnamara commented 1 month ago

Why not just use the Python version of the library, XlsxWriter: https://xlsxwriter.readthedocs.io/index.html

BinarSkugga commented 1 month ago

My reports have upward of 100k lines and it can takes 3-5 minutes to generate them using the python library. I am trying to gain some performance so the delay is not as bad. I tried using constant memory and the port for this in python seems abandoned.

jmcnamara commented 1 month ago

My reports have upward of 100k lines and it can takes 3-5 minutes to generate them using the python library.

It shouldn't take that long. Here is a quick test I did with the performance example in the XlsxWriter repo:

python dev/performance/perf_pyx.py 100000
100000,  50,  34.07, 0

It writes 100,000 rows by 50 columns of mixed numbers and strings in around 30 seconds. Try it out on your test machine. If you get similar results but your overall program takes 3 minutes then the bottleneck is elsewhere.

I tried using constant memory and the port for this in python seems abandoned.

It isn't abandoned. I maintain both XlsxWriter and libxlsxwriter and the constant_memory functionality is exactly the same in both.

BinarSkugga commented 1 month ago

I'll try that inside of our docker. I believe it's limited in RAM & CPU, that might be the issue, that or some abstraction around the library that we made.

I didn't mean the python library is abandoned, sorry about that. I meant the python port around the library in C is: https://github.com/pyexcel/libxlsxwpy

Apart from this, do you have any idea about the segfault ? I still want an alternative if the issue is something I have no control over.

jmcnamara commented 1 month ago

Apart from this, do you have any idea about the segfault ?

I don't. It is not something that I have encountered or have seen reported.

I still want an alternative if the issue is something I have no control over.

That is reasonable. If it is an option then the Rust version of this library rust_xlsxwriter has the speed of the C library and (if you use Rust)) the usability of the Python version. It also supports constant memory mode if needed: https://github.com/jmcnamara/rust_xlsxwriter

jmcnamara commented 1 month ago

I will need to close this because I don't believe it is a bug in libxlsxwriter. If you do find the source of the issue let me know.

BinarSkugga commented 1 month ago

No worries, thank for your help still :)