heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.96k stars 448 forks source link

Error Running HeavyDB with Nvidia Nsight Compute: Broken Pipe in Thrift Connection #824

Closed shwetaiisc closed 9 months ago

shwetaiisc commented 9 months ago

I am encountering an issue when attempting to run Star Schema Benchmark (SSB) queries on HeavyDB, with profiling using Nvidia's Nsight Compute (ncu). The queries run without issues when ncu is not involved, but running the HeavyDB server with ncu leads to a broken pipe error in the Thrift connection.

Environment: HeavyDB version: latest CUDA version: 12.1 GPU driver version: 530.30.02 Operating System: Ubuntu 20.04

Steps to Reproduce Start HeavyDB server with Nvidia Nsight Compute: ncu ./heavydb/build/bin/heavydb Note, I have tried the above command with sudo also, to allow ncu access to the hardware perf counters.

In a separate terminal, open the HeavyDB client.: ./heavydb/build/bin/heavysql -p HyperInteractive

Example query (tables are pre-populated):
select sum(lo_extendedprice * lo_discount) as revenue from lineorder, ddate where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

Expected Behavior The query executes smoothly with the HeavyDB server running under Nvidia Nsight Compute profiling.

Actual Behavior When the HeavyDB server is launched with Nvidia Nsight Compute, the following error is encountered:

Thrift error: No more data to read.                                                                     
Thrift connection error: No more data to read.                                                          
Retrying connection                                                                                     
Thrift connection error: write() send(): Broken pipe
Retrying connection
Thrift: [date and time] TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe

Additional Information The error seems to be related to the Thrift transport layer, specifically when the server is profiled with Nvidia Nsight Compute. This issue does not occur when the HeavyDB server runs without ncu profiling.

Request Any insights or solutions to resolve this broken pipe error when profiling HeavyDB with Nvidia Nsight Compute would be greatly appreciated.

cdessanti commented 9 months ago

That's weird. I am extently using ncu-ui recently on dev build without any issue.

The thrift error you are seeing is because, after that the db crashes the heavysql thrift call raise this exception because the transport in aborted

Inviato da Outlook per Androidhttps://aka.ms/AAb9ysg


Da: Shweta Pandey @.> Inviato: sabato, gennaio 27, 2024 2:17:52 PM A: heavyai/heavydb @.> Cc: Subscribed @.***> Oggetto: [heavyai/heavydb] Error Running HeavyDB with Nvidia Nsight Compute: Broken Pipe in Thrift Connection (Issue #824)

I am encountering an issue when attempting to run Star Schema Benchmark (SSB) queries on HeavyDB, with profiling using Nvidia's Nsight Compute (ncu)https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. The queries run without issues when ncu is not involved, but running the HeavyDB server with ncu leads to a broken pipe error in the Thrift connection.

Environment: HeavyDB version: latest CUDA version: 12.1 GPU driver version: 530.30.02 Operating System: Ubuntu 20.04

Steps to Reproduce Start HeavyDB server with Nvidia Nsight Compute: ncu ./heavydb/build/bin/heavydb Note, I have tried the above command with sudo also, to allow ncu access to the hardware perf counters.

In a separate terminal, open the HeavyDB client.: ./heavydb/build/bin/heavysql -p HyperInteractive

Example query (tables are pre-populated): select sum(lo_extendedprice * lo_discount) as revenue from lineorder, ddate where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

Expected Behavior The query executes smoothly with the HeavyDB server running under Nvidia Nsight Compute profiling.

Actual Behavior When the HeavyDB server is launched with Nvidia Nsight Compute, the following error is encountered:

Thrift error: write() send(): Broken pipe Thrift connection error: write() send(): Broken pipe Retrying connection Thrift: [date and time] TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe

Additional Information The error seems to be related to the Thrift transport layer, specifically when the server is profiled with Nvidia Nsight Compute. This issue does not occur when the HeavyDB server runs without ncu profiling.

Request Any insights or solutions to resolve this broken pipe error when profiling HeavyDB with Nvidia Nsight Compute would be greatly appreciated.

— Reply to this email directly, view it on GitHubhttps://github.com/heavyai/heavydb/issues/824, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHLFBF7I5WWCGJXJKQ4CC2LYQT473AVCNFSM6AAAAABCNLRWIOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDGNJYGQZTEMI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

shwetaiisc commented 9 months ago

Have you tried using NCU's command line interface? So I'm able to drop, create and copy tables when the server is running using NCU, but not run a select query. I'm also able to generate Nsight system (nsys) reports when heavydb server is executed with nsys.

I also noticed what you mentioned, after receiving no more data to be read exception, heavyDB server crashes and then the client reports pipe broken.

cdessanti commented 9 months ago

Hi,

I am using ncu-ui, but the underlying ncu command is used to profile the da base, and I'm running exclusively queries, but I'm focused on queries using plain tables without using any kind of join.

I cannot test with Ssb right now, but I will do ASAP, and I'll report back the result

Inviato da Outlook per Androidhttps://aka.ms/AAb9ysg


From: Shweta Pandey @.> Sent: Saturday, January 27, 2024 5:22:40 PM To: heavyai/heavydb @.> Cc: Candido Dessanti @.>; Comment @.> Subject: Re: [heavyai/heavydb] Error Running HeavyDB with Nvidia Nsight Compute: Broken Pipe in Thrift Connection (Issue #824)

Have you tried using NCU's command line interface? So I'm able to drop, create and copy tables when the server is running using NCU, but not run a select query. I'm also able to generate Nsight system (nsys) reports when heavydb server is executed with nsys.

I also noticed what you mentioned, after receiving no more data to be read exception, heavyDB server crashes and then the client reports pipe broken.

— Reply to this email directly, view it on GitHubhttps://github.com/heavyai/heavydb/issues/824#issuecomment-1913239805, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHLFBF6B32ZO4LAO33Q2PYTYQUSVBAVCNFSM6AAAAABCNLRWIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJTGIZTSOBQGU. You are receiving this because you commented.Message ID: @.***>

cdessanti commented 9 months ago

Hello,

I encountered a crash while using version 7.1 of the software. I was only able to profile the first three queries, possibly because I am using the 535 driver. However, with version 7.0, I am able to profile a few more queries, but I can profile the entire benchmark using ncu.

shwetaiisc commented 9 months ago

So should I use version 7.0 for profiling SSB? Also which kind of queries could you not profile with NCU and HeavyDB?

cdessanti commented 9 months ago

I just profiled the full benchmark, but with fewer rows in the lineorder table (100M instead of 600M). I suspect that my crashes are due to a lack of memory. Would you mind trying version 7.0? If you're using version 7.1, I suggest upgrading the driver to 535.

Unfortunately, I cannot provide you with more precise directions as I am still investigating the issue. Since the issue is related to the driver, it's not easy to figure out the exact problem.

cdessanti commented 9 months ago

update: I had a system memory problem on my workstation, and after freeing some,I was able to free up some memory and successfully profile heavyai 7.1 with 535 driver using 64GB of RAM. I was able to run almost all the queries of SSB benchmark with an SF of 100. However, the last two queries cannot be profiled because the data is too large to fit in the global memory of a single RTX 2080ti on my workstation.

The system memory needed during the profiling has been over 32GB when using a SF-100.

I used the 'ncu' tool with certain parameters to profile. Here is the command I used:

sudo /usr/local/NVIDIA-Nsight-Compute-2023.3/target/linux-desktop-glibc_2_11_3-x64/ncu --config-file off --export "/root/Documents/NVIDIA Nsight Compute/ssb_100_%i" --force-overwrite --kernel-name multifrag_query_hoisted_literals --metrics lts__average_gcomp_input_sector_success_rate.pct --set full --call-stack --nvtx --import-source yes --source-folder /opt/mapd_storage/github/master/heavydb-internal /opt/mapd/heavyai-ee-7.1.1-20230915-adbc472b74-Linux-x86_64-render/bin/heavydb --data /opt/mapd_storage/data48 --num-gpus=1

I have attached the profiling results that I obtained on my system.

Based on my analysis, I suggest upgrading the driver to version 535, as it should work better if you are using 7.1. Also, I recommend running the profiling with an SF suitable for the GPU and system memory installed in your system, because the profiler is likely to copy the GPU memory contents into system memory.

ssb_100_1.ncu-rep.log