heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.96k stars 448 forks source link

NVRM :Xid (PCI:0000:81:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus #798

Closed jieguolove closed 1 year ago

jieguolove commented 1 year ago

Ubuntu20.04.5 operating system: if you do not boot heavydb,nvidia-smi is normal, the temperature is around 65C, image image image but after a while after starting heavydb, nvidia-smi sees that the temperature of the GPU card will keep rising, and when it reaches 97C, the GPU card will go down. How to solve this problem? Is heavydb bug? image

The syslog is as follows:

911dac7808c0950d97b5c081fa44199

cdessanti commented 1 year ago

Hello,

I'm running heavydb on Ubuntu 20.0.6 without any issues on a workstation with 2 RTX2080ti cards, and we have many installations using T4, A10, and L4 cards. We have never observed the issue you're experiencing.

Additionally, even without any activity, our software doesn't utilize the GPU; it doesn't even change the power state of the card. I've noticed that the temperature of your card is quite high when idle. On my workstation, under heavy GPU load, using a card with the same architecture but a higher TDP than yours, the temperature barely reaches 72°C, with a room temperature of 27°C.

image

Which queries have you executed on the database? I can see that some memory has been allocated. Have you tried running other CUDA software on your system?

Maintaining a temperature of 67°C on an idle GPU with such a low TDP isn't ideal, and it's possible that you may encounter similar issues with other software utilizing CUDA.

Regards, Candido

jieguolove commented 1 year ago

Without running any other programs, the temperature was 40C, but the temperature began to rise after starting heavydb, and there was no data in the library yet. image It takes 2 minutes to get to 69C. image image

jieguolove commented 1 year ago

image

it's down!!!

image

In the past, it was normal to run inference programs on this machine, and there was no problem of such a change in temperature.

It's so incredible.

cdessanti commented 1 year ago

Hi,

yes, this behavior is strange.

Could you share the version of heavydb you are using, and try starting the server with these options cpu-only=true rendering=false

Candido

jieguolove commented 1 year ago

Hi,

yes, this behavior is strange.

Could you share the version of heavydb you are using, and try starting the server with these options cpu-only=true rendering=false

Candido image

download url: https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64.tar.gz

jieguolove commented 1 year ago

if i add these options,the heavydb cannot start!!! why?

cpu-only is ok

rendering is failed??? but we can see the option rendering : https://docs.heavy.ai/installation-and-configuration/config-parameters/configuration-parameters-for-heavydb 929fb9180b25bf76b82d57ff1679681

the rendering option is wrong???

heavyai@node13:/var/lib/heavyai$ more heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true #cpu-only is ok

rendering = false is failed???

[web] port = 6273 frontend = "/opt/heavyai/frontend"

image image

Since gpu is not used, the temperature will not rise.

But why is it not normal to enable GPU? Still hasn't solved the GPU problem.

cdessanti commented 1 year ago

I don't know.

Which error is returning the server?

Inviato da Outlook per Androidhttps://aka.ms/AAb9ysg


From: jieguolove @.> Sent: Sunday, September 10, 2023 11:13:16 AM To: heavyai/heavydb @.> Cc: Candido Dessanti @.>; Comment @.> Subject: Re: [heavyai/heavydb] NVRM :Xid (PCI:0000:81:00): 79, pid='', name=, GPU has fallen off the bus (Issue #798)

if i add these options,the heavydb cannot start!!! why?

@.***:/var/lib/heavyai$ more heavy.confbak port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only = true rendering = false

[web] port = 6273 frontend = "/opt/heavyai/frontend"

— Reply to this email directly, view it on GitHubhttps://github.com/heavyai/heavydb/issues/798#issuecomment-1712761195, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHLFBF3YJ47JJG6L7KU2G3LXZWACZANCNFSM6AAAAAA4P3SCKQ. You are receiving this because you commented.Message ID: @.***>

jieguolove commented 1 year ago

image

`

root@node13:~# vi /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true rendering=false ~ "/var/lib/heavyai/heavy.conf" 7L, 137C written root@node13:~# cat /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true rendering=false root@node13:~# systemctl stop heavydb root@node13:~# systemctl status heavydb ● heavydb.service - HEAVY.AI HeavyDB database server Loaded: loaded (/lib/systemd/system/heavydb.service; enabled; vendor preset: enabled) Active: inactive (dead) since Mon 2023-09-11 09:55:22 CST; 4s ago Process: 5646 ExecStart=/opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf (code=exited, status=0/SUCCESS) Main PID: 5646 (code=exited, status=0/SUCCESS)

9月 11 09:33:44 node13 heavydb[5646]: "." ... 9月 11 09:33:44 node13 heavydb[5646]:
9月 11 09:44:19 node13 heavydb[5646]: 2023-09-11T09:44:19.753792 E 5646 121 84069 DBHandler.cpp:1331 File or directory "/var/lib/heavyai/storage/import/sample_datasets/PIDTYPE_TABLE.csv" does not exist. 9月 11 09:46:23 node13 heavydb[5646]: 2023-09-11T09:46:23.264968 E 5646 139 1 DBHandler.cpp:1331 Table/View temp_jt_report_ipv6_tab for catalog hblt does not exist 9月 11 09:46:32 node13 heavydb[5646]: 2023-09-11T09:46:32.444595 E 5646 143 1 DBHandler.cpp:1331 Table temp_jt_report_ipv6_tab already exists and no data was loaded. 9月 11 09:46:32 node13 heavydb[5646]: 2023-09-11T09:46:32.450062 E 5646 144 1 DBHandler.cpp:1331 Table temp_jt_report_ipv6_tab already exists and no data was loaded. 9月 11 09:48:02 node13 heavydb[5646]: 2023-09-11T09:48:02.409741 E 5646 159 84069 DBHandler.cpp:1331 File or directory "/var/lib/heavyai/storage/import/sample_datasets/adsl_table.csv" does not exist. 9月 11 09:55:20 node13 systemd[1]: Stopping HEAVY.AI HeavyDB database server... 9月 11 09:55:22 node13 systemd[1]: heavydb.service: Succeeded. 9月 11 09:55:22 node13 systemd[1]: Stopped HEAVY.AI HeavyDB database server. root@node13:~# systemctl start heavydb root@node13:~# systemctl status heavydb ● heavydb.service - HEAVY.AI HeavyDB database server Loaded: loaded (/lib/systemd/system/heavydb.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Mon 2023-09-11 09:55:32 CST; 484ms ago Process: 313649 ExecStart=/opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf (code=exited, status=1/FAILURE) Main PID: 313649 (code=exited, status=1/FAILURE)

9月 11 09:55:32 node13 systemd[1]: heavydb.service: Scheduled restart job, restart counter is at 5. 9月 11 09:55:32 node13 systemd[1]: Stopped HEAVY.AI HeavyDB database server. 9月 11 09:55:32 node13 systemd[1]: heavydb.service: Start request repeated too quickly. 9月 11 09:55:32 node13 systemd[1]: heavydb.service: Failed with result 'exit-code'. 9月 11 09:55:32 node13 systemd[1]: Failed to start HEAVY.AI HeavyDB database server. root@node13:~# vi /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true

rendering=false

~ ~ "/var/lib/heavyai/heavy.conf" 7L, 138C written
root@node13:~# cat /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true

rendering=false

root@node13:~# systemctl start heavydb root@node13:~# systemctl status heavydb ● heavydb.service - HEAVY.AI HeavyDB database server Loaded: loaded (/lib/systemd/system/heavydb.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2023-09-11 09:56:47 CST; 2s ago Main PID: 313659 (heavydb) Tasks: 40 (limit: 149999) Memory: 111.6M CGroup: /system.slice/heavydb.service ├─313659 /opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf └─313667 -Xmx1024m -DLOG_DIR=/var/lib/heavyai/storage/log/ -jar /opt/heavyai/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /opt/heavyai/QueryEngine/ -d /var/lib/heavyai/storage -p >

9月 11 09:56:47 node13 systemd[1]: Started HEAVY.AI HeavyDB database server. root@node13:~# ` image

cdessanti commented 1 year ago

Hi,

To get the exact error that's preventing the server from correctly starting, you should share with us the heavydb.INFO.[timestamp].log relative to the startup of the server that you can fìind on the /var/lib/heavyai/storage/log directory on yourserver.

jieguolove commented 1 year ago

After change a machine with A100, the heavydb runs normally, and the GPU card does not keep the temperature rising all the time, but basically remains at 60C. There may be a hardware problem with the T4 card machine. Thank you very much for your patient answers!

cdessanti commented 1 year ago

Nevermind. The A100 will be more performant than the L4; it has a bit of updated architecture and a higher number of Cuda Cores, so you are using a better hardware for sure.