React better to partition migrations

mscastanho commented 2 years ago

Partition migrations on PowerVM may take a few minutes, and during that time libnxz will see consecutive paste failures, so we need to keep waiting until the migration is complete. So use different timeouts for waiting for a valid CSB and for a successful paste, giving PowerVM more time on the paste case to properly handle migrations.

There's currently a bug in the kernel side in which the CSB might not become valid for a long time during a migration unless we issue a new paste. To work around it nx_submit_job has been changed to avoid a nasty exit() call if nxu_run_job has returned with ETIMEDOUT. Instead, it now just returns -1, triggering a job resubmission.

mscastanho commented 2 years ago

I just realized this PR is not re-submitting the job in case of timeout waiting for CSB.V=1 as I described in the commit message. When that happens nxu_run_job and nx_submit_job just return the error up the stack. nx_deflate and nx_inflate_ ends up just returning Z_STREAM_ERROR to the user.

This is not ideal, because there's not much the user can do to recover from this error besides calling libnxz again. We could re-issue the job right from nxu_run_job, in the hopes that we'll get a proper response from NX eventually. In this case we wouldn't have to worry about the CC value returned by nx_submit_job anymore.

Something like:

diff --git a/lib/gzip_vas.c b/lib/gzip_vas.c
index 24f3e02..6271314 100644
--- a/lib/gzip_vas.c
+++ b/lib/gzip_vas.c
@@ -329,6 +329,9 @@ int nxu_run_job(nx_gzip_crb_cpb_t *cmdp, nx_devp_t nxhandle)
                                nx_fault_storage_address = 0;
                                continue;
                        }
+                       else if (ret == -ETIMEDOUT) {
+                               continue;
+                       }
                        else {
                                prt_err("wait_for_csb() returns %d\n", ret);
                                return ret;

A possible downside would be if there was a situation in which libnxz got stuck re-submitting jobs forever without ever getting a response under the timeout_wait_for_csb_v timeout. Are you aware of any such situations?

abalib commented 2 years ago

Are you aware of any such situations?

I am not aware. I don't know of any situation where csb.v will not be set unless there is a bug or hardware error.

Regarding "reissue" and assuming I understood your thinking: I would be concerned about reissuing the job, if csb.v timed out. What if the first job comes back after the second reissue? Two jobs might stomp on each others buffers. Does my question make sense here? (unless the first job may be cancelled somehow.)

Is there a problem waiting for csb.v in an infinite loop? Let's the sysadmin deal with stuck processes.

tuliom commented 2 years ago

Does my question make sense here?

@abalib I think it does and it's only showing it's important to get a proper csb.v before taking more decisions.

Is there a problem waiting for csb.v in an infinite loop?

That could cause starvation to other processes trying to use the NX GZIP while processes in an infinite loop lock the VAS credits. IMHO, it's fair to add a high timeout value (i.e. minutes) and avoid this scenario.

mscastanho commented 2 years ago

That could cause starvation to other processes trying to use the NX GZIP while processes in an infinite loop lock the VAS credits. IMHO, it's fair to add a high timeout value (i.e. minutes) and avoid this scenario.

@tuliom But if we don't add the infinite loop, and if we can't re-issue the job for the reasons @abalib mentioned (which I think makes sense by the way), the only thing left to do is return an error to the user, like this PR is currently doing.

This could put applications that were written with zlib in mind in a difficult situation, because they wouldn't know how to proceed since this would be an unexpected error.

tuliom commented 2 years ago

This could put applications that were written with zlib in mind in a difficult situation, because they wouldn't know how to proceed since this would be an unexpected error.

@mscastanho Yep. We try our best to avoid returning an error, but in this case we have no reasonable choice in my opinion. We should never have a situation where the CSB is invalid forever. This is a bug and bugs have to be fixed. If there is a very good reason to not fix this bug, I hope that we'll be able at least to distinguish between "CSB will never be valid" and "CSB will be valid soon".

That would allow libnxz to retry the same request safely.

Meanwhile, I think our only sane choice is to return an error to the user.

mscastanho commented 2 years ago

@tuliom Fair enough. I updated the PR:

Return -EAGAIN from nxu_run_job in case of timeout
Added the documentation about the value returned by nx_submit_job
Removed some dead code that would appear to return -EAGAIN to avoid confusion.

libnxz / power-gzip

React better to partition migrations #185