man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.45k stars 93 forks source link

Segmentation fault error calling library.read() using LMDB backend #181

Closed Naich closed 1 year ago

Naich commented 1 year ago

I am trying to test the performance of ArcticDB database using the LMDB protocol. However, while testing lib.read(), I occasionally encounter a "Segmentation fault (core dumped)" or "free(): invalid pointerdouble free or corruption (out) Aborted (core dumped)"error without any other error messages, causing the program to exit abruptly.

The error seems to be random as I have recorded the symbol and date_range that triggered the error, but upon retrying, I was able to retrieve the data successfully. Could you please provide guidance on possible reasons for the error and how to debug it? Thanks

I am using ArcticDB version 1.0.1, Python 3.8.16, and Ubuntu 20.04.1 LTS operating system.

Here is the code I used:

from arcticdb import Arctic
import pandas as pd
import random

ac_store = Arctic("lmdb:///data/1MinData/db/")
lib = ac_store.get_library('min_data')

for i in range(4000):
    random_ticker = "000001.SZ"
    start_date = pd.Timestamp('2017-10-08')
    end_date = pd.Timestamp('2018-10-08')
    data = lib.read(random_ticker, date_range=(start_date, end_date))

Here is the traceback info:

kan@Ares:~$ catchsegv python read_test.py 
free(): invalid pointer
Aborted (core dumped)
*** Segmentation fault
Register dump:

 RAX: 00007ea36bfff010   RBX: 0000000100000003   RCX: 00007f70cdd308e6
 RDX: 0000000100000003   RSI: 00007f70cd503450   RDI: 00007ea36bfff010
 RBP: 00007f6fb14d10d0   R8 : 00007ea36bfff010   R9 : 0000000000000000
 R10: 0000000000000022   R11: 0000000000000246   R12: 00007ea59400f670
 R13: 00007ea59400f6b8   R14: 00000000000000d0   R15: 00007ea59400f4f0
 RSP: 00007f6fb14d10a8

 RIP: 00007f70cdda3854   EFLAGS: 00010287

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000004   OldMask: 00000000   CR2: cd503433

 FPUCW: 0000037f   FPUSW: 00000020   TAG: 00007f70
 RIP: cca800b0   RDP: 00000000

 ST(0) ffff 8000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) ffff 81ceb32c4b43fcf5
 ST(4) ffff f800000000000000   ST(5) ffff 8000000000000000
 ST(6) ffff 8000000000000000   ST(7) c000 c000000000000000
 mxcsr: 1fa0
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000

Backtrace:
kan@Ares:~$

Attached is my environment info. environment.txt

mehertz commented 1 year ago

Hi @naich

Well that's weird!

Is it segfaulting on import do you know, or is it definitely on read? If it is on import, would you be able to try the wheel located here:

https://github.com/man-group/ArcticDB/suites/11870429345/artifacts/620829633

You'll have to be signed into GItHub to access that file. It is the same code, but built slightly differently which might help with your environment.

Naich commented 1 year ago

Hi mehertz,

I ran some additional tests and it appears that the issue I've been experiencing may be related to parallel write to the same database.

Here is the code I used for my tests:

Method 1: Sequential Write

from arcticdb import Arctic
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

ac_store = Arctic("lmdb://./testdb")
ac_store.create_library("test_db")
lib = ac_store.get_library('test_db')
date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000)
df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a'])
for _i in tqdm(range(1000)):
    lib.write(f"TEST_{_i}",df)

This method worked fine for read/write operations.

Method 2: Parallel Write on Different Symbols

from arcticdb import Arctic
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

from joblib import Parallel,delayed
#this function ensures no concurrent writes to single symbol
def write_ticker_db(db_id):
    _ac_store = Arctic("lmdb://./testdb")
    _lib = _ac_store['test_db']
    date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000)
    df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a'])
    _lib.write(f'TEST_{db_id}', df)
    return 0
rst = Parallel(n_jobs=40, backend='multiprocessing')(delayed(write_ticker_db)(db_id) for db_id in tqdm(range(1000)))

When I wrote data using Method 2, I found that when performing heavy reads, the program would occasionally throw a Segmentation fault error.

So does ArcticDB support concurrent writing on different symbols? In the documentation, it says that I need to set "staged=True" if I want to do concurrent writes on the same symbol, so it seems that concurrent writes on different symbols should be supported. However, maybe I am not understanding it correctly, and I would appreciate your suggestions. Thanks!

PS: I have already tried the wheel that you suggested on a clean environment, but I still get the same results.

Naich commented 1 year ago

Hi mehertz,

I ran some additional tests and it appears that the issue I've been experiencing may be related to parallel write to the same database.

Here is the code I used for my tests:

Method 1: Sequential Write

from arcticdb import Arctic
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

ac_store = Arctic("lmdb://./testdb")
ac_store.create_library("test_db")
lib = ac_store.get_library('test_db')
date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000)
df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a'])
for _i in tqdm(range(1000)):
    lib.write(f"TEST_{_i}",df)

This method worked fine for read/write operations.

Method 2: Parallel Write on Different Symbols

from arcticdb import Arctic
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

from joblib import Parallel,delayed
#this function ensures no concurrent writes to single symbol
def write_ticker_db(db_id):
    _ac_store = Arctic("lmdb://./testdb")
    _lib = _ac_store['test_db']
    date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000)
    df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a'])
    _lib.write(f'TEST_{db_id}', df)
    return 0
rst = Parallel(n_jobs=40, backend='multiprocessing')(delayed(write_ticker_db)(db_id) for db_id in tqdm(range(1000)))

When I wrote data using Method 2, I found that when performing heavy reads, the program would occasionally throw a Segmentation fault error.

So does ArcticDB support concurrent writing on different symbols? In the documentation, it says that I need to set "staged=True" if I want to do concurrent writes on the same symbol, so it seems that concurrent writes on different symbols should be supported. However, maybe I am not understanding it correctly, and I would appreciate your suggestions. Thanks!

PS: I have already tried the wheel that you suggested on a clean environment, but I still get the same results.

UPDATE: When I tried using larger actual stock data, even Sequential Write did not solve the problem. So it might be some other issu. However, this time I received more information in the traceback(with the wheel you suggested). I have attached the relevant tracebacks below. Could you please take a look and see if you can identify the issue?

Head of traceback#1:

*** Segmentation fault
Register dump:

 RAX: 00007f88d7207a30   RBX: 0000000002456a20   RCX: 00007ebe8b7fd0f8
 RDX: 0000000000000003   RSI: 00007ebe8b7fd220   RDI: 00007ebe8b7fd240
 RBP: 00007ebe8b7fd2d0   R8 : 00007ebe8b7fcdf0   R9 : 00007ebe8b7fcd00
 R10: 00007f88d7345c50   R11: 000000000000001f   R12: 0000000002456a20
 R13: 0000000000000000   R14: 00007ebe8b7fd240   R15: 00007ebe8b7fd268
 RSP: 00007ebe8b7fd0f8

 RIP: 00007f88d6b5aa90   EFLAGS: 00010202

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000015   OldMask: 00000000   CR2: d6b5aa90

 FPUCW: 0000037f   FPUSW: 00000020   TAG: 00007f88
 RIP: d6b21edb   RDP: 00000000

 ST(0) ffff 8000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) ffff 81ceb32c4b43fcf5
 ST(4) ffff f800000000000000   ST(5) ffff b000000000000000
 ST(6) ffff d000000000000000   ST(7) d000 d000000000000000
 mxcsr: 1fa2
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000

Backtrace:
/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcti/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f88d7c4c420]

Memory map:

00400400000-00423000 r--p 00000000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
00423000-005ef000 r-xp 00023000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
005ef000-006e5000 r--01e01ef000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e6000-006e7000 r--p 002e5000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e7000-0071f000 rw-p 002e6000 08:05 5557471 /home/kan/.conda/envs/arctic/bin/python3.8
0071f000-0073f000 rw-p 00000000 00:00 0
013a9000-046a6000 rw-p 00000000 00:00 0 [heap]

Head of traceback#2:

*** Segmentation fault
Register dump:

 RAX: 00007f18340a1ea8   RBX: 00007e4eac0008d0   RCX: 0000000000000003
 RDX: 00007e4ea0016c20   RSI: 00007e4ea000cda0   RDI: 00007e4eac016b60
 RBP: 00007e4f8ced22d0   R8 : 0000000000000002   R9 : 0000000000020c69
 R10: 00007e4ea000cfb0   R11: 00007e4ea000cf20   R12: 0000000000000000
 R13: 00007f18340a0ef0   R14: 00007e4e9c00acc0   R15: 00007e4f8ced24c0
 RSP: 00007e4f8ced22a0

 RIP: 00007f183215ed65   EFLAGS: 00010202

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000004   OldMask: 00000000   CR2: 00000068

 FPUCW: 0000037f   FPUSW: 00000020   TAG: 00007f18
 RIP: 339f0edb   RDP: 00000000

 ST(0) ffff 8000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) ffff 81ceb32c4b43fcf5
 ST(4) ffff f800000000000000   ST(5) ffff b000000000000000
 ST(6) ffff d000000000000000   ST(7) d000 d000000000000000
 mxcsr: 1fa2
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000

Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x52)[0x7f1834855722]
/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ex/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x3a)[0x7f183211839/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x26a573)[0x7f18321a957/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x22e376)[0x7f183216d37/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x23569d)[0x7f183217469/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x235967)[0x7f183217496/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x24cc86)[0x7f183218bc86/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14f05cd)[0x7f183342f5cd/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14e34d2)[0x7f18334224d2/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1500ff1)[0x7f183343fff1/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1507b92)[0x7f1833446b92/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x155f166)[0x7f183349e166/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x155fa4f)[0x7f183349ea4f/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1501a85)[0x7f1833440a85/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x15021df)[0x7f18334411df/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x150223b)[0x7f183344123b/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1504988)[0x7f1833443988/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14e4419)[0x7f1833423419/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14f1e45)[0x7f1833430e45/l/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arct/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f18348da133]

Memory map:

00400000-00423000 r--p 00000000 08:05 55574715 /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f18348da133]

Memory map:

00400000-00423000 r--p 00000000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
00423000-005ef000 r-xp 00023000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
005ef000-006e5000 r--p 001ef000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e6000-006e7000 r--p 002e5000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e7000-0071f000 rw-p 002e6000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
0071f000-0073f000 rw-p 00000000 00:00 0
0120f000-041ac000 rw-p 00000000 00:00 0 [heap]

output2.txt output1.txt

buhbuhtig commented 1 year ago

I have the same problem with sequential reading (LMDB backend). It helped:

from arcticdb_ext import set_config_int
set_config_int('VersionStore.NumCPUThreads', 1)

or

from arcticdb_ext import set_config_int
set_config_int('VersionStore.NumIOThreads', 1)
Naich commented 1 year ago

from arcticdb_ext import set_config_int set_config_int('VersionStore.NumIOThreads', 1)

Thanks! It worked, so it seemed to be a multithread-read related bug.

However the read throughput (in my case) dropped about 40% if apply this tread limit. Currently I am using another dirty workaround: because the lib.read() seemed to return correct data, just occasionally failed, so I started a subprocess to read, and if it failed I replaced it with another new process. In my case the performance drop is only about 10%.

from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=1)
def run_with_timeout(fn, timeout=0.1):
    global executor
    future = executor.submit(fn)
    try:
        result = future.result(timeout=timeout)
    except Exception as e:
        print('Error: {}; retrying'.format(e))
        future.cancel()
        executor.shutdown()
        executor = ProcessPoolExecutor(max_workers=1)
        result = run_with_timeout(fn, timeout)

    return result

for i in range(4000):
    result = run_with_timeout(random_lib_read, timeout=0.1)
huroh commented 1 year ago

from arcticdb_ext import set_config_int set_config_int('VersionStore.NumIOThreads', 1)

Thanks! It worked, so it seemed to be a multithread-read related bug.

However the read throughput (in my case) dropped about 40% if apply this tread limit. Currently I am using another dirty workaround: because the lib.read() seemed to return correct data, just occasionally failed, so I started a subprocess to read, and if it failed I replaced it with another new process. In my case the performance drop is only about 10%.

from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=1)
def run_with_timeout(fn, timeout=0.1):
    global executor
    future = executor.submit(fn)
    try:
        result = future.result(timeout=timeout)
    except Exception as e:
        print('Error: {}; retrying'.format(e))
        future.cancel()
        executor.shutdown()
        executor = ProcessPoolExecutor(max_workers=1)
        result = run_with_timeout(fn, timeout)

    return result

for i in range(4000):
    result = run_with_timeout(random_lib_read, timeout=0.1)

Could you give the example of the function you are passing to run_with_timeout please, and how you call it? I am trying to follow your dirty workaround, with a simple 'return arcdb['eom_fut_cont'].read(item_name).data' in my function, but it just times out. If I don't include the timeout it just hangs.

def read_lib(item_name):
    return arcdb['eom_fut_cont'].read(item_name).data

def run_until_success(fn, item_name, timeout=0.1):
    global executor
    future = executor.submit(fn, item_name)
    try:
        result = future.result(timeout=timeout)
    except Exception as e:
        print('Error: {}; retrying'.format(e))
        future.cancel()
        executor.shutdown()
        executor = ProcessPoolExecutor(max_workers=1)
        result = run_until_success2(fn, item_name, timeout)

    return result

test = run_until_success(read_lib, item_name) 

I hope there can be a proper fix to this issue, it's happening very often when I try to read data, pretty much making arcticdb unusable for me.

mehertz commented 1 year ago

Thanks @huroh @Naich.

We've made a lot of changes to LMDB recently designed to fix long-standing issues with our implementation (#597 #585). We're going to confirm whether these changes have resolved your issues.

poodlewars commented 1 year ago

hi @Naich @huroh please could you retest with arcticdb==1.6.0 which we released yesterday and includes several fixes for LMDB.

You might also be interested in the "Threads and Processes" section http://www.lmdb.tech/doc/starting.html . What you are doing, if I understand it correctly, should be safe, but it's worth noting that you should never open the same LMDB environment more than once from a given process.

In our case, creating the Arctic instance _ac_store = Arctic("lmdb://./testdb") is what creates the LMDB environment.

huroh commented 1 year ago

Working fine for me now, thank you so much @poodlewars @mehertz and team for all your work on this great project