Python segfault / corrupted double-linked list (not small) in a docker container

t0k4rt commented 1 year ago

What language are you using?

Python

What version are you using?

0.3.0

What database are you using?

Postgresql

What dataframe are you using?

Pandas

Can you describe your bug?

I've got a python script running in docker that loads data from sql to a panda dataframe, depending on the volume of data it can load between 15gb and 60gb of data in memory.

This issue is not related to docker memory limits, the script is monitored and fails well below docker container memory limit.

The issue I get is complicated. It mainly fails silently, I need to open dmesg to see some segfault happening.

It seems to happen when data has finished downloading from database

This issue is docker specific, when I run the script on my dev machine (withou docker), everything is going well.

It seems to me there is 2 cases:

First case: my python script use 15gb memory

When data transfer is finished, the python script fails silently and triggers a segfault:

[17623143.600654] python[696237]: segfault at 0 ip 00007f746595680a sp 00007ffc7119fe60 error 6 in connectorx.cpython-38-x86_64-linux-gnu.so[7f746555d000+201c000]
[17623143.606349] Code: 41 56 53 48 83 ec 18 49 89 fe 66 48 8d 3d 9e df 68 02 66 66 48 e8 e6 6e c0 ff 48 83 38 00 74 18 48 83 c0 08 48 83 38 00 74 2e <49> 83 06 ff 74 77 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 70 df 68 02

My process then restarts (i've got a restart policy for my failed containers) and in this case, the process does not fail (this happened each times).

Second case: my script use 30gb memory

When data transfer is finished, the python script fails with the error "corrupted double-linked list (not small)" and seems to triggers the same kind of segfault:

[17623143.600654] python[696237]: segfault at 0 ip 00007f746595680a sp 00007ffc7119fe60 error 6 in connectorx.cpython-38-x86_64-linux-gnu.so[7f746555d000+201c000]
[17623143.606349] Code: 41 56 53 48 83 ec 18 49 89 fe 66 48 8d 3d 9e df 68 02 66 66 48 e8 e6 6e c0 ff 48 83 38 00 74 18 48 83 c0 08 48 83 38 00 74 2e <49> 83 06 ff 74 77 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 70 df 68 02

My process then restarts (i've got a restart policy for my failed containers) then the process stil fails.

What are the steps to reproduce the behavior?

I cannot reproduce this on my local machine because my local docker instance hits container memory limit and container is shut down.

Host is running on debian 11 (128gb ram/24cores)

Our containers are using latest python 3.8.15 built using pyenv with these specific build flags: RUN CONFIGURE_OPTS="--enable-shared" PYTHON_CFLAG="-march=haswell -O3 -pipe" pyenv install ${PYTHON_VERSION}

Database setup if the error only happens on specific data or data type

Example query / code

This scripts and query should generate the same kind of data we are using with a high enough volume on a docker container:

test_bug.py

import connectorx as cx
import time
import os

query="""
WITH time as (SELECT generate_series('2022-01-01', '2022-08-01', '1 second'::interval) as timestamp_client),
ids AS (SELECT cid from (values('B45668C2-BFDC-4861-A38D-6141933F6940'),('40ABA32A-24EE-4876-8568-8E8E51D1D942'),('1837CE67-4BCC-4936-BC6D-76874BE1C4FF'),('D9194F09-7122-4EC7-AE81-FCB13A06B4EA'), ('645641C3-4E84-4475-AF85-1DFFBFE18726')) AS x(cid))
SELECT
    cid,
    timestamp_client,
    500*random() as accuracy,
    'ios' as os,
    'UTC' as timezone,
    40562 as place_id,
    random() as confidence,
    500*random() as distance
FROM ids
JOIN time ON True
ORDER BY timestamp_client ASC
"""
print(time.ctime())
print(os.environ.get('PG_CONN_URL'))
print(query)
result=cx.read_sql(os.environ.get('PG_CONN_URL'), query)
print(result.shape)
print(time.ctime())

Dockerfile

FROM python:3.8-bullseye

# install dependencies
RUN pip install connectorx pandas==1.3.5
RUN pip list
COPY ./test_bug.py /test_bug.py

ENTRYPOINT ["python", "/test_bug.py"]

docker build --pull -t connectorx-bug -f Dockerfile . docker run --env PG_CONN_URL=your_db_conn_url test-dataprocessing-gps-visit

What is the error?

Segfault

[17623143.600654] python[696237]: segfault at 0 ip 00007f746595680a sp 00007ffc7119fe60 error 6 in connectorx.cpython-38-x86_64-linux-gnu.so[7f746555d000+201c000]
[17623143.606349] Code: 41 56 53 48 83 ec 18 49 89 fe 66 48 8d 3d 9e df 68 02 66 66 48 e8 e6 6e c0 ff 48 83 38 00 74 18 48 83 c0 08 48 83 38 00 74 2e <49> 83 06 ff 74 77 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 70 df 68 02

And sometimes "corrupted double-linked list (not small)"

t0k4rt commented 1 year ago

When I have some time I'll check if it's related to the python version built from sources

Babbleshack commented 1 year ago

Am not sure if you had time to test against differnt python versions, but we are experience a similar issue on python 3.9.14. Specifically we are doing a join on large datasets (>60gb).

t0k4rt commented 1 year ago

Am not sure if you had time to test against differnt python versions, but we are experience a similar issue on python 3.9.14. Specifically we are doing a join on large datasets (>60gb).

I'm working on it, i'm building some docker images to test my code with different python version. I'll keep you updated when I've some news !

kmatt commented 1 year ago

Similar, selecting 100,000 rows from MS SQL Server, on Ubuntu 22.04.1 (5.15.0-56-generic), Python 3.10.6.

Not running in Docker, but a VMWare VM in this case:

[1191578.055637] show_signal_msg: 22 callbacks suppressed
[1191578.055642] python3[408830]: segfault at 0 ip 00007f75e99e0bca sp 00007ffc0b5e5660 error 6 in connectorx.cpython-310-x86_64-linux-gnu.so[7f75e9694000+1ee8000]
[1191578.055663] Code: 41 56 53 48 83 ec 18 48 89 fb 66 48 8d 3d 46 5e 60 02 66 66 48 e8 26 3b cb ff 48 83 38 00 74 17 48 83 c0 08 48 83 38 00 74 2d <48> ff 0b 74 75 48 83 c4 18 5b 41 5e c3 66 48 8d 3d 19 5e 60 02 66

sfu-db / connector-x