Clean up crashed plotting temp files after crash

tajition commented 3 years ago

Every now and then some plotting processes will crash and the plotter will start up the next job, but the temporary files will be left on the temp drive. This can result on the drive being full and plotter processes unable to write. When this happens I have to manually find crashed plot's temp files and delete them.

I am not familiar with python programming ( mostly nodeJS + angular dev) but I think here is it can be achieved, get process exit code to determine if plotting process has finished successfully. This way if the process code is 0 move on to the next job, if not get the plot id from the log file. This would require keeping track of PID and associating them with log files somewhere. Search for all the files with the plot name and delete them before starting the next job. This approach is probably problematic if the manager process itself is restarted, but it could be a fallback to help in certain cases.

Another approach I can think of is to keep track of the PID and plot id. Before starting a new job check last N log files (same as number of jobs) and make sure the process finished. If not check that processes are actually running. If PID is not present delete the temp files.

These are just some first thought I had. I have encountered these issues at least 4-5 times in the past month and it is getting really annoying. Seems like solvable though. The crashing issues are probably due to my use of the computer as main dev PC and I also game on it from time to time.

Like I said I am not familiar with python programming much but if you point me in the right direction I am happy to try and implement something to address this issue.

EDIT:

For refence my computer specs are: Ryzen 3700x 32GB (2x 16GB 3200CL16) ram Plotting Drives: 512 intel SSD (2 jobs in parallel) 1TB samsung 970EVO ssd (4 jobs in parallel) Destination drives: 8TB seagate skyhawk drives.

When looking at system resource usage the plotting doesn't use all the system resources. I use 2 threads per job and 3418 ram.

sothix commented 3 years ago

You can use the following bash script to do a clean up in the meantime. It will search for the running chia processes related to your temp directory and will delete any files that don't have the same timestamp.

Change both /mnt/nvme1 to your temp directory
This script can be tested first by removing the -delete and adding a file to your temp directory for it to find (e.g touch /mnt/nvme/test.txt)

ps -eo lstart,pid,cmd | grep "chia plots create -k 32 -b 11000 -t /mnt/nvme1" | grep -v grep | awk '{
    b="plot-k32-"
    cmd="date -d\""$1 FS $2 FS $3 FS $4 FS $5"\" +\047%Y-%m-%d-%H-%M\047"; 
     cmd | getline d; 
     close(cmd); 
     if(tmp){
       tmp=sprintf("%s\|%s%s",tmp,b,d);
     } else {
       tmp=sprintf("%s%s",b,d);;
     };
 } END {
    cmd=sprintf("find /mnt/nvme1 -type f ! -regex \".*\(%s\).*\" -delete",tmp);
    print cmd;  
 }' | /bin/sh

Sartory commented 3 years ago

Hi, I've got the same problem on linux Ubuntu 20: There are 8 jobs shown in the manager, that are already finished and moved by hand. I'm not able to remove/ reset them: Bildschirmfoto 2021-08-02 um 15 36 04

What I tried:

find . | grep -E "(__pycache__|\.pyc|\.pyo$)" | xargs rm -rf
git reset --hard HEAD
deleted corresponding logs in home/chia-logs

How to remove broken/ old jobs from the manager?

Thanks in advance!

swar / Swar-Chia-Plot-Manager

Clean up crashed plotting temp files after crash #1124