madMAx43v3r / chia-gigahorse

221 stars 32 forks source link

Plotting stuck on copy multiple destinations #142

Open lookasiamrose opened 1 year ago

lookasiamrose commented 1 year ago

Hello, im stuck without ideas on cuda_plotter_k32 issue, is this expected behavior that if copy for multiple destination extends plotting time (waiting for temporary directory to be empty and copy process is still ongoing on all destinations) we got this: image Basically its never resuming. Copy is done (temp directory is empty) as per attached log. But its done, its like this for hours :) So its stuck on "busy, waiting..." and after some minutes the copy commands are finished one by one and thats it, its never resuming plotting. On my setup its replicable every time. Is HP DL580 g8 server, with Smartarray P400 P200 controllers in HBA mode.

Thanks for help, appreciate your effort!

lookasiamrose commented 1 year ago

We fixed the issue above (plotter randomly stuck between random table, just copy processes finishes and nothing continues). You wouldn't believe but uninstalling xorg server and light-dm packages complete with purge fixed the issue :)

We had them installed in ubuntu server environment because we wanted also to overclock the GPUs on that system. So we did all the steps required (you can check any advices on web, its basically installing those mentioned packages, setting variables and turning on the "bits"). Then you can just change the clocks of memory or vram, fans and so on with nvidia cli comands.

So i have no idea what was the issue, but without those packages plotter is not stucking randomly. its working properly for days now. My only hint for this issue is that when the problem was present, and the plotter was stuck - there was like 1 thread in this multicore server that was 100% usage). And sometimes before plotting starts, after issueing the initiating command it was frozen, its stuck like that for 3 min before the plotter process started logging something. Also sometimes table was stuck for some time and you have normally 50-90s and randomly 400s for step...

I have no idea what was that, temps on gpu and everywhere ok, checked everything, maybe it will help someone ^ Bests, Lj