QUDA GPU Meltdown "bug"

azrael417 commented 9 years ago

Hey,

sorry for the pathetic naming bug cool "bugs" have to have cool names :). I run QUDA on JUDGE at FZ Jülich in Germany and they have m2050 as well as m2070 Tesla GPUs and they are passively cooled (yes, that exists). When I run dozens of inversions on the card i get double bit errors, meaning that the ECC parity check cannot recover some flipped bits in memory. Shortly after these messages, the GPU die due to overheating. It cashes in the QUDA part of the code and I guess the reason for that is its efficiency. So therefore, it is not strictly a "bug" but an inconvenient feature. Therefore, would it be possible to add some heat control as an optional feature, such that the code halts for a couple of cycles if the GPU gets too hot? Is that possible? This sounds crazy but on the JUDGE machine the problem is serious, I already burnt >50 GPU within 3 months.

Shall I compile with host- and device-debug first and see what really happens and then send the output to someone?

Best Thorsten

maddyscientist commented 9 years ago

Thorsten, this is not a QUDA issue. If the GPUs are overheating, then this sounds like there are serious issues with the JUDGE system, e.g., airflow problems. I would suggest you contact the system administrator of JUDGE to explain your problem. It's possible that there are air flow issues on this cluster.

I note almost all GPU in a server environment are passively cooled, since an external fan provides much better cooling than the active cooled fans. You can't build a GPU cluster without using passive cooling (or are willing to use liquid cooling).

GPUs automatically should have this "heat control" built in whereby if they get too hot they will down clock to bring the temperature under control. What temperature are you seeing? A typically passively cooled GPU system should see temperatures of around 60C. The maximum reliable temperature is about 90C.

Moreover, QUDA actually consumes a fraction of the the TDP of a GPU, because it is memory bandwidth bound, you typically only get 10-20% of peak throughput. What this means is that on a GPU with a TDP of 235 watts, running QUDA you will only consumer 100-150 watts at most.

azrael417 commented 9 years ago

Hi Mike,

the problems only occurs to me when I use QUDA. I asked the admins for some performance and electricity consumption data. The GPUS are old, but somehow they should not die, I agree. However, they said that they replace every single one which fails and if I just continue running, I basically get a new cluster to run on.

Best Thorsten

Am 27.10.2014 um 12:55 schrieb mikeaclark notifications@github.com:

Thorsten, this is not a QUDA issue. If the GPUs are overheating, then this sounds like there are serious issues with the JUDGE system, e.g., airflow problems. I would suggest you contact the system administrator of JUDGE to explain your problem. It's possible that there are air flow issues on this cluster.

I note almost all GPU in a server environment are passively cooled, since an external fan provides much better cooling than the active cooled fans. You can't build a GPU cluster without using passive cooling (or are willing to use liquid cooling).

GPUs automatically should have this "heat control" built in whereby if they get too hot they will down clock to bring the temperature under control. What temperature are you seeing? A typically passively cooled GPU system should see temperatures of around 60C. The maximum reliable temperature is about 90C.

Moreover, QUDA actually consumes a fraction of the the TDP of a GPU, because it is memory bandwidth bound, you typically only get 10-20% of peak throughput. What this means is that on a GPU with a TDP of 235 watts, running QUDA you will only consumer 100-150 watts at most.

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/166#issuecomment-60658432.

azrael417 commented 9 years ago

and, what is very strange is, that the m2050 die almost immediately, after an hour or so, while the m2070 can run for 12 hours without having any issue (almost). This morning I got the first DBE on one of two m2070 GPUs.

Am 27.10.2014 um 12:55 schrieb mikeaclark notifications@github.com:

Thorsten, this is not a QUDA issue. If the GPUs are overheating, then this sounds like there are serious issues with the JUDGE system, e.g., airflow problems. I would suggest you contact the system administrator of JUDGE to explain your problem. It's possible that there are air flow issues on this cluster.

I note almost all GPU in a server environment are passively cooled, since an external fan provides much better cooling than the active cooled fans. You can't build a GPU cluster without using passive cooling (or are willing to use liquid cooling).

GPUs automatically should have this "heat control" built in whereby if they get too hot they will down clock to bring the temperature under control. What temperature are you seeing? A typically passively cooled GPU system should see temperatures of around 60C. The maximum reliable temperature is about 90C.

Moreover, QUDA actually consumes a fraction of the the TDP of a GPU, because it is memory bandwidth bound, you typically only get 10-20% of peak throughput. What this means is that on a GPU with a TDP of 235 watts, running QUDA you will only consumer 100-150 watts at most.

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/166#issuecomment-60658432.

maddyscientist commented 9 years ago

I have an NVIDIA colleague who is based at Jülich. Perhaps I should put you two in direct contact as he will be better placed to help resolve what's going on.

maddyscientist commented 9 years ago

Closing, since this is not a QUDA bug.

lattice / quda

QUDA GPU Meltdown "bug" #166