martymac / fpart

Sort files and pack them into partitions
https://www.fpart.org/
BSD 2-Clause "Simplified" License
230 stars 39 forks source link

fpsync status/progress reporting does not work correctly #48

Closed ispringer closed 1 year ago

ispringer commented 1 year ago

When estimating time remaining, it appears to only consider whether the jobs were submitted, not how much longer remaining jobs will take. Here's an example:

1687462151 <=== [QMGR] Done submitting jobs. Waiting for them to finish.
1687462190 <= [QMGR] Job 23067:248:local finished
1687462199 <= [QMGR] Job 27451:280:local finished

1687462200 <=== Parts done: 281/281 (100%), remaining: 0
1687462200 <=== Time elapsed: 1025s, remaining: ~0s (~3s/job)

1687462281 <=== Parts done: 281/281 (100%), remaining: 0
1687462281 <=== Time elapsed: 1106s, remaining: ~0s (~3s/job)

# 40 minutes pass, but status does not change at all aside from Time elapsed increasing
# during this time the final part is being processed. this part is 10x bigger than any other part due to containing a single huge file

1687464541 <= [QMGR] Job 5504:1:local finished
1687464541 <=== [QMGR] Queue processed
1687464541 <=== Parts done: 281/281 (100%), remaining: 0
1687464541 <=== Time elapsed: 3366s, remaining: ~0s (~11s/job)
1687464541 <=== Fpsync completed without error in 3366s.
1687464541 <=== End time: Thu Jun 22 20:09:01 UTC 2023
martymac commented 1 year ago

Hello Ian,

Thanks for your report.

Remaining time is calculated using observed (overall) average time per job and remaining jobs count.

There is indeed a problem in what you show here. Fpsync should be showing 1 remaining job and an associated time (!= '0s').

The problem is the last running processes (in your case, the last one) are accounted as 'done' because part of the work dir. I'll fix that ASAP, thanks!

As a side note, if partitions are not balanced (what you are observing here), remaining time will probably not be accurate, but at least it should not be '0s'.

Best regards,

Ganael.

martymac commented 1 year ago

Hello Ian,

ETA should be fixed with latest commit. Can you give it a try ?

Best regards,

Ganael.

ispringer commented 1 year ago

Looks good! As you said, the time estimates are still way off, but at least they are not zero.

1687797050 <=== Fpart crawling finished
1687797125 <=== Parts done: 56/281 (19%), remaining: 225 (4 running)
1687797125 <=== Time elapsed: 85s, remaining: ~225s (~1s/job)
...
1687797969 <=== Parts done: 277/281 (98%), remaining: 4 (4 running)
1687797969 <=== Time elapsed: 929s, remaining: ~12s (~3s/job)
...
1687798045 <=== Parts done: 278/281 (98%), remaining: 3 (3 running)
1687798045 <=== Time elapsed: 1005s, remaining: ~9s (~3s/job)
...
1687798068 <=== Parts done: 279/281 (99%), remaining: 2 (2 running)
1687798068 <=== Time elapsed: 1028s, remaining: ~6s (~3s/job)
...
1687798105 <=== Parts done: 280/281 (99%), remaining: 1 (1 running)
1687798105 <=== Time elapsed: 1065s, remaining: ~3s (~3s/job)

Fyi, this is the progress reporting function I wrote for our mongo-sync script. It uses the recursive size of the source versus target dirs to estimate the time remaining, so it's much more accurate.

function await_completion() {
  local target_node="$1"

  local start_time_secs=$(date +%s)
  local source_data_dir=$(source_data_dir)
  local target_data_dir=$(target_data_dir ${target_node})

  # Wait on all jobs to finish successfully
  while [[ $(jobs -r | wc -l) -gt 0 ]]; do
    local local_size_bytes=$(/usr/bin/du -s ${source_data_dir} 2>/dev/null | awk '{print $1}')
    local remote_size_bytes=$(ssh ${target_node} /usr/bin/du -s ${target_data_dir} 2>/dev/null | awk '{print $1}')
    if [[ ${remote_size_bytes} -eq 0 ]]; then
      local percent_complete=0
    else
      local percent_complete=$(bc <<<"scale=1; ${remote_size_bytes} * 100 / ${local_size_bytes}")
    fi
    if [[ ${percent_complete} == 0 ]]; then
      # We can't estimate time remaining if percent complete is 0
      echo "0% complete"
      continue
    fi
    local current_time_secs=$(date +%s)
    local elapsed_secs=$((current_time_secs - start_time_secs))
    local total_secs=$(bc <<<"scale=0; ${elapsed_secs} * 100 / ${percent_complete}")
    local remaining_secs=$((total_secs - elapsed_secs))
    if [[ ${remaining_secs} -lt 3600 ]]; then
      local time_remaining="$((remaining_secs / 60)) minutes"
    else
      local time_remaining="$(bc <<<"scale=1; ${remaining_secs} / 3600") hours"
    fi
    echo "${percent_complete}% complete, estimated time remaining ${time_remaining}"
    sleep 90
  done
}

Also, fpsync prints "Use ^T (SIGINFO) to display status" when debug logging is enabled. But on Linux, Ctrl-T is not mapped to SIGINFO. Actually, I don't think SIGINFO is even a thing on Linux. But I was able to make fpsync print a status report by using kill -29 <fpsync_pid>.

martymac commented 1 year ago

Thanks for your feedback!

Regarding ^T, it only works on FreeBSD. As you said, you'll have to send the signal manually under Linux :)

Thanks again for your report, I'll close that issue. Feel free to reopen it if necessary.

Cheers,

Ganael.