Closed jieguolove closed 1 year ago
Hello,
I'm running heavydb on Ubuntu 20.0.6 without any issues on a workstation with 2 RTX2080ti cards, and we have many installations using T4, A10, and L4 cards. We have never observed the issue you're experiencing.
Additionally, even without any activity, our software doesn't utilize the GPU; it doesn't even change the power state of the card. I've noticed that the temperature of your card is quite high when idle. On my workstation, under heavy GPU load, using a card with the same architecture but a higher TDP than yours, the temperature barely reaches 72°C, with a room temperature of 27°C.
Which queries have you executed on the database? I can see that some memory has been allocated. Have you tried running other CUDA software on your system?
Maintaining a temperature of 67°C on an idle GPU with such a low TDP isn't ideal, and it's possible that you may encounter similar issues with other software utilizing CUDA.
Regards, Candido
Without running any other programs, the temperature was 40C, but the temperature began to rise after starting heavydb, and there was no data in the library yet. It takes 2 minutes to get to 69C.
it's down!!!
In the past, it was normal to run inference programs on this machine, and there was no problem of such a change in temperature.
It's so incredible.
Hi,
yes, this behavior is strange.
Could you share the version of heavydb you are using, and try starting the server with these options cpu-only=true rendering=false
Candido
Hi,
yes, this behavior is strange.
Could you share the version of heavydb you are using, and try starting the server with these options cpu-only=true rendering=false
Candido
download url: https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64.tar.gz
if i add these options,the heavydb cannot start!!! why?
rendering is failed??? but we can see the option rendering : https://docs.heavy.ai/installation-and-configuration/config-parameters/configuration-parameters-for-heavydb
the rendering option is wrong???
heavyai@node13:/var/lib/heavyai$ more heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true #cpu-only is ok
[web] port = 6273 frontend = "/opt/heavyai/frontend"
Since gpu is not used, the temperature will not rise.
But why is it not normal to enable GPU? Still hasn't solved the GPU problem.
I don't know.
Which error is returning the server?
Inviato da Outlook per Androidhttps://aka.ms/AAb9ysg
From: jieguolove @.>
Sent: Sunday, September 10, 2023 11:13:16 AM
To: heavyai/heavydb @.>
Cc: Candido Dessanti @.>; Comment @.>
Subject: Re: [heavyai/heavydb] NVRM :Xid (PCI:0000:81:00): 79, pid='
if i add these options,the heavydb cannot start!!! why?
@.***:/var/lib/heavyai$ more heavy.confbak port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only = true rendering = false
[web] port = 6273 frontend = "/opt/heavyai/frontend"
— Reply to this email directly, view it on GitHubhttps://github.com/heavyai/heavydb/issues/798#issuecomment-1712761195, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHLFBF3YJ47JJG6L7KU2G3LXZWACZANCNFSM6AAAAAA4P3SCKQ. You are receiving this because you commented.Message ID: @.***>
`
root@node13:~# vi /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true rendering=false ~ "/var/lib/heavyai/heavy.conf" 7L, 137C written root@node13:~# cat /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true rendering=false root@node13:~# systemctl stop heavydb root@node13:~# systemctl status heavydb ● heavydb.service - HEAVY.AI HeavyDB database server Loaded: loaded (/lib/systemd/system/heavydb.service; enabled; vendor preset: enabled) Active: inactive (dead) since Mon 2023-09-11 09:55:22 CST; 4s ago Process: 5646 ExecStart=/opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf (code=exited, status=0/SUCCESS) Main PID: 5646 (code=exited, status=0/SUCCESS)
9月 11 09:33:44 node13 heavydb[5646]: "." ...
9月 11 09:33:44 node13 heavydb[5646]:
9月 11 09:44:19 node13 heavydb[5646]: 2023-09-11T09:44:19.753792 E 5646 121 84069 DBHandler.cpp:1331 File or directory "/var/lib/heavyai/storage/import/sample_datasets/PIDTYPE_TABLE.csv" does not exist.
9月 11 09:46:23 node13 heavydb[5646]: 2023-09-11T09:46:23.264968 E 5646 139 1 DBHandler.cpp:1331 Table/View temp_jt_report_ipv6_tab for catalog hblt does not exist
9月 11 09:46:32 node13 heavydb[5646]: 2023-09-11T09:46:32.444595 E 5646 143 1 DBHandler.cpp:1331 Table temp_jt_report_ipv6_tab already exists and no data was loaded.
9月 11 09:46:32 node13 heavydb[5646]: 2023-09-11T09:46:32.450062 E 5646 144 1 DBHandler.cpp:1331 Table temp_jt_report_ipv6_tab already exists and no data was loaded.
9月 11 09:48:02 node13 heavydb[5646]: 2023-09-11T09:48:02.409741 E 5646 159 84069 DBHandler.cpp:1331 File or directory "/var/lib/heavyai/storage/import/sample_datasets/adsl_table.csv" does not exist.
9月 11 09:55:20 node13 systemd[1]: Stopping HEAVY.AI HeavyDB database server...
9月 11 09:55:22 node13 systemd[1]: heavydb.service: Succeeded.
9月 11 09:55:22 node13 systemd[1]: Stopped HEAVY.AI HeavyDB database server.
root@node13:~# systemctl start heavydb
root@node13:~# systemctl status heavydb
● heavydb.service - HEAVY.AI HeavyDB database server
Loaded: loaded (/lib/systemd/system/heavydb.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2023-09-11 09:55:32 CST; 484ms ago
Process: 313649 ExecStart=/opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf (code=exited, status=1/FAILURE)
Main PID: 313649 (code=exited, status=1/FAILURE)
9月 11 09:55:32 node13 systemd[1]: heavydb.service: Scheduled restart job, restart counter is at 5. 9月 11 09:55:32 node13 systemd[1]: Stopped HEAVY.AI HeavyDB database server. 9月 11 09:55:32 node13 systemd[1]: heavydb.service: Start request repeated too quickly. 9月 11 09:55:32 node13 systemd[1]: heavydb.service: Failed with result 'exit-code'. 9月 11 09:55:32 node13 systemd[1]: Failed to start HEAVY.AI HeavyDB database server. root@node13:~# vi /var/lib/heavyai/heavy.conf port = 6274 http-port = 6278 calcite-port = 6279 data = "/var/lib/heavyai/storage" null-div-by-zero = true cpu-only=true
~
~
"/var/lib/heavyai/heavy.conf" 7L, 138C written
root@node13:~# cat /var/lib/heavyai/heavy.conf
port = 6274
http-port = 6278
calcite-port = 6279
data = "/var/lib/heavyai/storage"
null-div-by-zero = true
cpu-only=true
root@node13:~# systemctl start heavydb root@node13:~# systemctl status heavydb ● heavydb.service - HEAVY.AI HeavyDB database server Loaded: loaded (/lib/systemd/system/heavydb.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2023-09-11 09:56:47 CST; 2s ago Main PID: 313659 (heavydb) Tasks: 40 (limit: 149999) Memory: 111.6M CGroup: /system.slice/heavydb.service ├─313659 /opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf └─313667 -Xmx1024m -DLOG_DIR=/var/lib/heavyai/storage/log/ -jar /opt/heavyai/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /opt/heavyai/QueryEngine/ -d /var/lib/heavyai/storage -p >
9月 11 09:56:47 node13 systemd[1]: Started HEAVY.AI HeavyDB database server. root@node13:~# `
Hi,
To get the exact error that's preventing the server from correctly starting, you should share with us the heavydb.INFO.[timestamp].log relative to the startup of the server that you can fìind on the /var/lib/heavyai/storage/log directory on yourserver.
After change a machine with A100, the heavydb runs normally, and the GPU card does not keep the temperature rising all the time, but basically remains at 60C. There may be a hardware problem with the T4 card machine. Thank you very much for your patient answers!
Nevermind. The A100 will be more performant than the L4; it has a bit of updated architecture and a higher number of Cuda Cores, so you are using a better hardware for sure.
Ubuntu20.04.5 operating system: if you do not boot heavydb,nvidia-smi is normal, the temperature is around 65C, but after a while after starting heavydb, nvidia-smi sees that the temperature of the GPU card will keep rising, and when it reaches 97C, the GPU card will go down. How to solve this problem? Is heavydb bug?
The syslog is as follows: