ekersey commented 2 years ago

Describe the bug Plotman, on a plotter only worker, is not starting a new Madmax instance as expected when the 1st has started to copy to destination folder. Watching the plotting status on the /plotting/jobs page shows that when it gets to phase 4:1 it just stays there even after the copy to dst has started and it should be in phase 5:1. Incidentally, the "wall" time also stops updating. Also the "Plotting Speed" charts do not show the stats from the 2nd worker at all.

To Reproduce I have a "fullnode" Machinaris container on the primary server, and a "plotter" worker on a 2nd machine. The plotman config files are mostly default aside from dst paths and number of threads for madmax. (the scheduling bits are the same).

Plotting jobs on the "fullnode" work as expected. When reaching stage 5:1 (copying to destination) a new plotter gets spun up while the copying proceeds. Plotting jobs on the "plotter" worker never update the status when reaching stage 5:1, and therefore never spin up a new plotting job until the copy to destination has actually finished.

Expected behavior Expected that plotter on 2nd machine starts a new plot job when the 1st one starts copying final plot to destination (stage 5:1).

System setup:

OS: Unraid 6.10.0-rc2 (with docker-compose)
Docker version number 20.10.9
Machinaris branch: test
Machinaris version number 0.6.9

Config

Plotting scheduling parameters (same on both workers)

scheduling:

Run a job on a particular temp dir only if the number of existing jobs

    # before tmpdir_stagger_phase_major tmpdir_stagger_phase_minor
    # is less than tmpdir_stagger_phase_limit.
    # Phase major corresponds to the plot phase, phase minor corresponds to
    # the table or table pair in sequence, phase limit corresponds to
    # the number of plots allowed before [phase major, phase minor]
    # Five is the final move stage of madmax
    tmpdir_stagger_phase_major: 5
    tmpdir_stagger_phase_minor: 0
    # Optional: default is 1
    tmpdir_stagger_phase_limit: 1

    # Don't run more than this many jobs at a time on a single temp dir.
    # Increase for staggered plotting by chia, leave at 2 for madmax sequential plotting
    tmpdir_max_jobs: 2

    # Don't run more than this many jobs at a time in total.
    # Increase for staggered plotting by chia, leave at 2 for madmax sequential plotting
    global_max_jobs: 2 

    # Don't run any jobs (across all temp dirs) more often than this, in minutes.
    global_stagger_m: 30

    # How often the daemon wakes to consider starting a new plot job, in seconds.
    polling_time_s: 20

Additional context & screenshots

guydavis commented 2 years ago

Hi, thanks for the detailed report. Please provide plotman status output from the 2ndary plotter worker at stage shown above. On the plotting system:

docker exec -it machinaris bash
plotman status

Please paste a screenshot of this status output from the plotman CLI.
Please also provide a full screenshot of your Workers page. Thanks.

ekersey commented 2 years ago

plotman status command appears to hang and does not return anything at all:

Workers page:

ekersey commented 2 years ago

plotman status finally returned when the plot finished copying to destination. Executing again right after shows new plot started.

guydavis commented 2 years ago

Hi, thanks for the detailed response. Very interesting that the separate plotman status process on your Ripper machine actually hung for the duration of the copy. As that is a separate plotman invocation, which simply spins through the container's process list looking for other running plotman and chia_plot processes, I am thinking there is some resource contention that is slowing/pausing the entire Docker container, not just the single Plotman job.

Secondly, you mentioned that the same plotting configuration on your Fullnode system does not exhibit this issue. Since the code is the same, that indicates something is different between the two systems at a hardware/volume level.

What differences existing in the dst volume path on each system?
Is the dst path on each system an Unassigned Device in Unraid?
Or perhaps a drive share with parity? If so, is caching turned on for that share?
Or is the dst path on Ripper plotter is a remote network share over the ChiaBoxZero fullnode?

I would recommend experimenting with different dst locations on the plotter. The original Plotman author's design, particularly for remote plotting, was for:

tmp: to be SSD or RAM disk.
dst: to be a locally attached staging drive
Then use Plotman archving to transfer the completed plot from the staging location in dst over to a final location, either a remote server or a slow local drive.

Hope this helps, Guy

ekersey commented 2 years ago

Just found this, probably related: https://github.com/ericaltendorf/plotman/issues/714

Should mention that host OS for secondary plotter is Ubuntu 20.04, and destination drive is mounted NFS share from primary Unraid server.

Going to switch to local dst when current job is finished. Looks like maybe what I need to do is figure out how to get rsync setup for the archiver.

Edit: You posted as I was writing this. I'll report back this afternoon when first job with local dst finishes, but I suspect we'll be able to close this as not a Machinaris issue.

ekersey commented 2 years ago

Setting dst to a local path appears to have solved the problem. Sorry to have wasted your time.

It's a shame though, seems like a pointless hop to get to the final destination. And with the only drive I have available to use as a local dst currently, it's actually slower than copying over the network. So what was a 1 hour copy delay from the time a plot finished to when it was harvestable, is now 2.5 to 3 hours.

Maybe I'll try leaving the finished plot on the tmp folder and writing my own script to move it to the Unraid share.

Anyhow, not a Machinaris issue. Thank you.

guydavis commented 2 years ago

Hi, no worries. One thing to try would leaving the dst list empty in the plotman settings. My understanding is that Madmax would then consider the plot complete and leave it in the tmp location (probably an SSD). I believe you could then use Plotman archiving to transfer it from the tmp location to the remote system via rsync. I have not tested this scenario myself however, so please take this with a grain of salt.

Cheers!

guydavis / machinaris

Plotter status not updating correctly #558

Plotting scheduling parameters (same on both workers)

Run a job on a particular temp dir only if the number of existing jobs