Closed Naich closed 1 year ago
Hi @naich
Well that's weird!
Is it segfaulting on import do you know, or is it definitely on read? If it is on import, would you be able to try the wheel located here:
https://github.com/man-group/ArcticDB/suites/11870429345/artifacts/620829633
You'll have to be signed into GItHub to access that file. It is the same code, but built slightly differently which might help with your environment.
Hi mehertz,
I ran some additional tests and it appears that the issue I've been experiencing may be related to parallel write to the same database.
Here is the code I used for my tests:
Method 1: Sequential Write
from arcticdb import Arctic
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
ac_store = Arctic("lmdb://./testdb")
ac_store.create_library("test_db")
lib = ac_store.get_library('test_db')
date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000)
df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a'])
for _i in tqdm(range(1000)):
lib.write(f"TEST_{_i}",df)
This method worked fine for read/write operations.
Method 2: Parallel Write on Different Symbols
from arcticdb import Arctic
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from joblib import Parallel,delayed
#this function ensures no concurrent writes to single symbol
def write_ticker_db(db_id):
_ac_store = Arctic("lmdb://./testdb")
_lib = _ac_store['test_db']
date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000)
df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a'])
_lib.write(f'TEST_{db_id}', df)
return 0
rst = Parallel(n_jobs=40, backend='multiprocessing')(delayed(write_ticker_db)(db_id) for db_id in tqdm(range(1000)))
When I wrote data using Method 2, I found that when performing heavy reads, the program would occasionally throw a Segmentation fault error.
So does ArcticDB support concurrent writing on different symbols? In the documentation, it says that I need to set "staged=True" if I want to do concurrent writes on the same symbol, so it seems that concurrent writes on different symbols should be supported. However, maybe I am not understanding it correctly, and I would appreciate your suggestions. Thanks!
PS: I have already tried the wheel that you suggested on a clean environment, but I still get the same results.
Hi mehertz,
I ran some additional tests and it appears that the issue I've been experiencing may be related to parallel write to the same database.
Here is the code I used for my tests:
Method 1: Sequential Write
from arcticdb import Arctic import pandas as pd import numpy as np from tqdm.notebook import tqdm ac_store = Arctic("lmdb://./testdb") ac_store.create_library("test_db") lib = ac_store.get_library('test_db') date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000) df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a']) for _i in tqdm(range(1000)): lib.write(f"TEST_{_i}",df)
This method worked fine for read/write operations.
Method 2: Parallel Write on Different Symbols
from arcticdb import Arctic import pandas as pd import numpy as np from tqdm.notebook import tqdm from joblib import Parallel,delayed #this function ensures no concurrent writes to single symbol def write_ticker_db(db_id): _ac_store = Arctic("lmdb://./testdb") _lib = _ac_store['test_db'] date_idx = pd.Timestamp('2021-1-1') + pd.timedelta_range(start='1 days', end='720 days', periods=1_000_000) df = pd.DataFrame(np.random.random(size=(len(date_idx))), index=date_idx, columns=['a']) _lib.write(f'TEST_{db_id}', df) return 0 rst = Parallel(n_jobs=40, backend='multiprocessing')(delayed(write_ticker_db)(db_id) for db_id in tqdm(range(1000)))
When I wrote data using Method 2, I found that when performing heavy reads, the program would occasionally throw a Segmentation fault error.
So does ArcticDB support concurrent writing on different symbols? In the documentation, it says that I need to set "staged=True" if I want to do concurrent writes on the same symbol, so it seems that concurrent writes on different symbols should be supported. However, maybe I am not understanding it correctly, and I would appreciate your suggestions. Thanks!
PS: I have already tried the wheel that you suggested on a clean environment, but I still get the same results.
UPDATE: When I tried using larger actual stock data, even Sequential Write did not solve the problem. So it might be some other issu. However, this time I received more information in the traceback(with the wheel you suggested). I have attached the relevant tracebacks below. Could you please take a look and see if you can identify the issue?
Head of traceback#1:
*** Segmentation fault
Register dump:
RAX: 00007f88d7207a30 RBX: 0000000002456a20 RCX: 00007ebe8b7fd0f8
RDX: 0000000000000003 RSI: 00007ebe8b7fd220 RDI: 00007ebe8b7fd240
RBP: 00007ebe8b7fd2d0 R8 : 00007ebe8b7fcdf0 R9 : 00007ebe8b7fcd00
R10: 00007f88d7345c50 R11: 000000000000001f R12: 0000000002456a20
R13: 0000000000000000 R14: 00007ebe8b7fd240 R15: 00007ebe8b7fd268
RSP: 00007ebe8b7fd0f8
RIP: 00007f88d6b5aa90 EFLAGS: 00010202
CS: 0033 FS: 0000 GS: 0000
Trap: 0000000e Error: 00000015 OldMask: 00000000 CR2: d6b5aa90
FPUCW: 0000037f FPUSW: 00000020 TAG: 00007f88
RIP: d6b21edb RDP: 00000000
ST(0) ffff 8000000000000000 ST(1) 0000 0000000000000000
ST(2) 0000 0000000000000000 ST(3) ffff 81ceb32c4b43fcf5
ST(4) ffff f800000000000000 ST(5) ffff b000000000000000
ST(6) ffff d000000000000000 ST(7) d000 d000000000000000
mxcsr: 1fa2
XMM0: 00000000000000000000000000000000 XMM1: 00000000000000000000000000000000
XMM2: 00000000000000000000000000000000 XMM3: 00000000000000000000000000000000
XMM4: 00000000000000000000000000000000 XMM5: 00000000000000000000000000000000
XMM6: 00000000000000000000000000000000 XMM7: 00000000000000000000000000000000
XMM8: 00000000000000000000000000000000 XMM9: 00000000000000000000000000000000
XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000
Backtrace:
/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcti/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f88d7c4c420]
Memory map:
00400400000-00423000 r--p 00000000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
00423000-005ef000 r-xp 00023000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
005ef000-006e5000 r--01e01ef000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e6000-006e7000 r--p 002e5000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e7000-0071f000 rw-p 002e6000 08:05 5557471 /home/kan/.conda/envs/arctic/bin/python3.8
0071f000-0073f000 rw-p 00000000 00:00 0
013a9000-046a6000 rw-p 00000000 00:00 0 [heap]
Head of traceback#2:
*** Segmentation fault
Register dump:
RAX: 00007f18340a1ea8 RBX: 00007e4eac0008d0 RCX: 0000000000000003
RDX: 00007e4ea0016c20 RSI: 00007e4ea000cda0 RDI: 00007e4eac016b60
RBP: 00007e4f8ced22d0 R8 : 0000000000000002 R9 : 0000000000020c69
R10: 00007e4ea000cfb0 R11: 00007e4ea000cf20 R12: 0000000000000000
R13: 00007f18340a0ef0 R14: 00007e4e9c00acc0 R15: 00007e4f8ced24c0
RSP: 00007e4f8ced22a0
RIP: 00007f183215ed65 EFLAGS: 00010202
CS: 0033 FS: 0000 GS: 0000
Trap: 0000000e Error: 00000004 OldMask: 00000000 CR2: 00000068
FPUCW: 0000037f FPUSW: 00000020 TAG: 00007f18
RIP: 339f0edb RDP: 00000000
ST(0) ffff 8000000000000000 ST(1) 0000 0000000000000000
ST(2) 0000 0000000000000000 ST(3) ffff 81ceb32c4b43fcf5
ST(4) ffff f800000000000000 ST(5) ffff b000000000000000
ST(6) ffff d000000000000000 ST(7) d000 d000000000000000
mxcsr: 1fa2
XMM0: 00000000000000000000000000000000 XMM1: 00000000000000000000000000000000
XMM2: 00000000000000000000000000000000 XMM3: 00000000000000000000000000000000
XMM4: 00000000000000000000000000000000 XMM5: 00000000000000000000000000000000
XMM6: 00000000000000000000000000000000 XMM7: 00000000000000000000000000000000
XMM8: 00000000000000000000000000000000 XMM9: 00000000000000000000000000000000
XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000
Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x52)[0x7f1834855722]
/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ex/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x3a)[0x7f183211839/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x26a573)[0x7f18321a957/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x22e376)[0x7f183216d37/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x23569d)[0x7f183217469/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x235967)[0x7f183217496/ho/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x24cc86)[0x7f183218bc86/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14f05cd)[0x7f183342f5cd/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14e34d2)[0x7f18334224d2/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1500ff1)[0x7f183343fff1/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1507b92)[0x7f1833446b92/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x155f166)[0x7f183349e166/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x155fa4f)[0x7f183349ea4f/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1501a85)[0x7f1833440a85/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x15021df)[0x7f18334411df/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x150223b)[0x7f183344123b/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x1504988)[0x7f1833443988/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14e4419)[0x7f1833423419/h/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arcticdb_ext.cpython-38-x86_64-linux-gnu.so(+0x14f1e45)[0x7f1833430e45/l/home/kan/.conda/envs/arctic/lib/python3.8/site-packages/arct/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f18348da133]
Memory map:
00400000-00423000 r--p 00000000 08:05 55574715 /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f18348da133]
Memory map:
00400000-00423000 r--p 00000000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
00423000-005ef000 r-xp 00023000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
005ef000-006e5000 r--p 001ef000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e6000-006e7000 r--p 002e5000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
006e7000-0071f000 rw-p 002e6000 08:05 55574715 /home/kan/.conda/envs/arctic/bin/python3.8
0071f000-0073f000 rw-p 00000000 00:00 0
0120f000-041ac000 rw-p 00000000 00:00 0 [heap]
I have the same problem with sequential reading (LMDB backend). It helped:
from arcticdb_ext import set_config_int
set_config_int('VersionStore.NumCPUThreads', 1)
or
from arcticdb_ext import set_config_int
set_config_int('VersionStore.NumIOThreads', 1)
from arcticdb_ext import set_config_int set_config_int('VersionStore.NumIOThreads', 1)
Thanks! It worked, so it seemed to be a multithread-read related bug.
However the read throughput (in my case) dropped about 40% if apply this tread limit. Currently I am using another dirty workaround: because the lib.read() seemed to return correct data, just occasionally failed, so I started a subprocess to read, and if it failed I replaced it with another new process. In my case the performance drop is only about 10%.
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=1)
def run_with_timeout(fn, timeout=0.1):
global executor
future = executor.submit(fn)
try:
result = future.result(timeout=timeout)
except Exception as e:
print('Error: {}; retrying'.format(e))
future.cancel()
executor.shutdown()
executor = ProcessPoolExecutor(max_workers=1)
result = run_with_timeout(fn, timeout)
return result
for i in range(4000):
result = run_with_timeout(random_lib_read, timeout=0.1)
from arcticdb_ext import set_config_int set_config_int('VersionStore.NumIOThreads', 1)
Thanks! It worked, so it seemed to be a multithread-read related bug.
However the read throughput (in my case) dropped about 40% if apply this tread limit. Currently I am using another dirty workaround: because the lib.read() seemed to return correct data, just occasionally failed, so I started a subprocess to read, and if it failed I replaced it with another new process. In my case the performance drop is only about 10%.
from concurrent.futures import ProcessPoolExecutor executor = ProcessPoolExecutor(max_workers=1) def run_with_timeout(fn, timeout=0.1): global executor future = executor.submit(fn) try: result = future.result(timeout=timeout) except Exception as e: print('Error: {}; retrying'.format(e)) future.cancel() executor.shutdown() executor = ProcessPoolExecutor(max_workers=1) result = run_with_timeout(fn, timeout) return result for i in range(4000): result = run_with_timeout(random_lib_read, timeout=0.1)
Could you give the example of the function you are passing to run_with_timeout please, and how you call it? I am trying to follow your dirty workaround, with a simple 'return arcdb['eom_fut_cont'].read(item_name).data' in my function, but it just times out. If I don't include the timeout it just hangs.
def read_lib(item_name):
return arcdb['eom_fut_cont'].read(item_name).data
def run_until_success(fn, item_name, timeout=0.1):
global executor
future = executor.submit(fn, item_name)
try:
result = future.result(timeout=timeout)
except Exception as e:
print('Error: {}; retrying'.format(e))
future.cancel()
executor.shutdown()
executor = ProcessPoolExecutor(max_workers=1)
result = run_until_success2(fn, item_name, timeout)
return result
test = run_until_success(read_lib, item_name)
I hope there can be a proper fix to this issue, it's happening very often when I try to read data, pretty much making arcticdb unusable for me.
Thanks @huroh @Naich.
We've made a lot of changes to LMDB recently designed to fix long-standing issues with our implementation (#597 #585). We're going to confirm whether these changes have resolved your issues.
hi @Naich @huroh please could you retest with arcticdb==1.6.0
which we released yesterday and includes several fixes for LMDB.
You might also be interested in the "Threads and Processes" section http://www.lmdb.tech/doc/starting.html . What you are doing, if I understand it correctly, should be safe, but it's worth noting that you should never open the same LMDB environment more than once from a given process.
In our case, creating the Arctic instance _ac_store = Arctic("lmdb://./testdb")
is what creates the LMDB environment.
Working fine for me now, thank you so much @poodlewars @mehertz and team for all your work on this great project
I am trying to test the performance of ArcticDB database using the LMDB protocol. However, while testing lib.read(), I occasionally encounter a "Segmentation fault (core dumped)" or "free(): invalid pointerdouble free or corruption (out) Aborted (core dumped)"error without any other error messages, causing the program to exit abruptly.
The error seems to be random as I have recorded the symbol and date_range that triggered the error, but upon retrying, I was able to retrieve the data successfully. Could you please provide guidance on possible reasons for the error and how to debug it? Thanks
I am using ArcticDB version 1.0.1, Python 3.8.16, and Ubuntu 20.04.1 LTS operating system.
Here is the code I used:
Here is the traceback info:
Attached is my environment info. environment.txt