Harvester on linux sometimes take ~5G of memory

sobertram commented 9 months ago

Using latest:

~/chia-gigahorse-farmer$ ./chia.bin version
2.1.1.giga25

This would have gone unnoticed but because i am also plotting on this machine. The plotter crashed with the following message:

terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to allocate 536870912 bytes of MEM_TYPE_DEVICE: CUDA error 2: out of memory

Did nvidia-smi right after the crash and noticed the high mem use of the harvester: nvidia-smi

Thu Dec 14 11:01:19 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:01:00.0 Off |                    0 |
| N/A   80C    P0              26W /  75W |   5712MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    824716      C   chia_harvester                             5708MiB |
+---------------------------------------------------------------------------------------+

I restarted the harvester and that cleared it up. Typical memory use for harvesting is 520MiB. Here is memory usage when both plotter and harvester are running:

Thu Dec 14 11:14:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:01:00.0 Off |                    0 |
| N/A   90C    P0              35W /  75W |   6402MiB /  7680MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    839768      C   chia_harvester                              520MiB |
|    0   N/A  N/A    839941      C   ...chia-gigahorse-farmer/cuda_plot_k32     5672MiB |
+---------------------------------------------------------------------------------------+

Running on 5.4.0-167-generic #184-Ubuntu SMP Tue Oct 31 09:21:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux kernel.

I am just switching my system over to gigahorse and got through 11 plots, this failed on phase2 of the 12th plot. System was running for about 2.5 to 3 hours before the issue.

Plotting info:

Chia k32 next-gen CUDA plotter - 6ec48cb
Plot Format: mmx-v2.5
Network Port: 8444 [chia]
No. GPUs: 1
No. Streams: 3
Direct IO: No
Final Destination: /farmtmp/
Bucket Chunk Size: 8 MiB
Max Pinned Memory: 480 GiB
Number of Plots: 43
Initialization took 0.104 sec
Crafting plot 1 out of 43 (2023/12/14 08:19:58)
Process ID: 829823
Pool Puzzle Hash:  xxx
Farmer Public Key: xxx
Working Directory:   /nvplots/
Working Directory 2: /nvplots/
Compression Level: C20

Last thing in log before the crash:

[P2] Setup took 0.129 sec
[P2] Table 7 took 19.551 sec, 1.63467 GB/s up, 0.0259753 GB/s down
[P2] Table 6 took 19.042 sec, 1.67943 GB/s up, 0.0266696 GB/s down
Phase 2 took 38.82 sec
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to allocate 536870912 bytes of MEM_TYPE_DEVICE: CUDA error 2: out of memory

sobertram commented 9 months ago

Just happened again after entering this ticket. Just wanted to add it is not always around 5GB it just takes up enough so that the plotter cannot allocate enough memory. In the past it would sit steadily at ~500MB. This instance it took ~3GB and happened much faster than before. Will see if i can find anything else in the logs that give a clue to what triggers it.

Thu Dec 14 11:32:05 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:01:00.0 Off |                    0 |
| N/A   82C    P0              26W /  75W |   3836MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    839768      C   chia_harvester                             3832MiB |
+---------------------------------------------------------------------------------------+

duncandubick commented 9 months ago

Farming C20 plots takes that much VRAM... It only shoots up that high when it encounters a C20 plot and as you say you just started replotting with GH you probably only have a few so far.

https://github.com/madMAx43v3r/chia-gigahorse?tab=readme-ov-file#ram--vram-requirements-to-farm

You may want to reconsider C20 if you're only farming with a P4.

sobertram commented 9 months ago

Farming C20 plots takes that much VRAM... It only shoots up that high when it encounters a C20 plot and as you say you just started replotting with GH you probably only have a few so far.

https://github.com/madMAx43v3r/chia-gigahorse?tab=readme-ov-file#ram--vram-requirements-to-farm

You may want to reconsider C20 if you're only farming with a P4.

Thanks so as long as i am not plotting in parallell it should be able to manage it. I will take this into consideration as I press forward.

sobertram commented 9 months ago

As @duncandubick pointed out this seems consistent and makes sense. closing.

In case others run into this, as a fix i moved harvesting duties from the plotter to another harvester following instructions here, https://github.com/madMAx43v3r/chia-gigahorse?tab=readme-ov-file#remote-compute. Another fix could also be to add another GPU to your plotter.

madMAx43v3r / chia-gigahorse

Harvester on linux sometimes take ~5G of memory #250