heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.93k stars 445 forks source link

Intermitted SIGSEGV errors crashing heavyDB #819

Open anirudh-here-com opened 9 months ago

anirudh-here-com commented 9 months ago

Version: 6.4.0 While running some queries against heavydb, SIGSEGV errors occur randomly causing the DB to crash and create outages. Any way to debug/fix this? HeavyDB.cpp:332 Interrupt signal (11) received.

cdessanti commented 9 months ago

Could you please share the product logs? The logs can be found in the storage directory, typically located at /var/log/heavyai/storage/log. They are named as heavydb.INFO.*

It is essential to check the logs to identify the problem. Is there a specific reason why you're using version 6.4 when versions 7.0 and 7.1 are available?

anirudh-here-com commented 9 months ago

I have done detailed analysis of this issue and found the issue.. This happens when I do a select_ipc_gpu on the database and it returns 0 records..

anirudh-here-com commented 9 months ago

This can be easily replicated by using heavyai lib's select_ipc_gpu function to

import heavyai
conn=heavyai.connect(user=<user>, password=<pass>, dbname=<dbname>)
conn.select_ipc_gpu(<any select query which returns 0 rows>)
//thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

The reason we're using 6.4 is because we have some custom patches for our usecases

Is this fixed on the latest version 7.1? If so, we might migration to the latest version

Thanks,

cdessanti commented 9 months ago

Hi,

Thanks for reporting the issue. I will try to reproduce it on our end. If I am successful, I will create an internal case for our engineering team to fix the problem.

Can you try running your application without using the GPU-shared memory as a temporary solution?

Also, I am interested in your modifications to the database to support your application. Could you please share what they are doing?

Best regards, Candido

anirudh-here-com commented 9 months ago

Thanks for your reply. Unfortunately using gpu shared memory is required and cannot be discarded. Regarding the modifications, I plan to raise a pull request for the same.

Please let me know if you're able to replicate the issue on your end. Thanks, Anirudh

cdessanti commented 9 months ago

Hi,

Using CUDA 11.8 and the latest version of GA (7.2.1), I was able to reproduce the issue on my end. I have created an internal ticket to resolve the issue.

I'll come back here whenthe problem is fixed. (