usdf_eq_write EQ full handling

rfaucett commented 9 years ago

I'd like to try to reserve one EQ element when making normal entry writes so that a call to fi_eq_write that would fill the queue instead inserts an EQ error entry saying "EQ overflow" (whatever error code is appropriate). Add testcase to eq_test for this also.

xuywang commented 9 years ago

Several thoughts about this and eq APIs in general

The man page of "fi_eq_readerr" says "EQs are optimized to report operations which have completed successfully. Operations which fail are reported 'out of band'. Such operations are retrieved using the fi_eq_readerr function. When an operation that completes with an unexpected error is inserted into an EQ, it is placed into a temporary error queue." usdf eq implementation actually uses the same queue for reporting success operation and error operation. Why? If an error is inserted in the end of the queue and there are other pending success events in front, fi_eq_readerr cannot retrieve the error event, especially with the case here (insert a EQ overflow error event if fi_eq_write would fill the queue).

What's the rationale to have an out-of-band of error event queue? I would think the the error event needs immediate attention. For av completion failure, it's not necessary to put it into the error event queue, as it might block the processes waiting for successful AV resolution completion events if they actually arrive before error event.

This is regarding to the general API. I don't understand the rationale that if an operation fails, two events have to be enqueued, one is the failure event, the other is the completion event. Why does application need to inspect the two events since the failure event is adequate to indicate the operation is complete? It will be extra burden for application to discard the completion event as it may already know of from the failure event. Also since the enqueing of both events is not guaranteed to be atomic, there might be other event inserted in-between. It would be difficult for application to write code handing those.
General API question too. Don't understand the need of application calling fi_eq_write. Is there an sample use case for this? Would it be allowed that application and provider can write the same eq? What if the eq has an pending error event when fi_eq_write is called, will FI_EAVAIL be returned same as fi_eq_read? The man page does not specify.

rfaucett commented 9 years ago

The man page is misleading. The error event should not "jump in front of" other successful completions. Rationale for "out of band" is that some HW supports EQs in hardware and reporting success is faster than getting failure details. It does not really matter for us, more of a nuisance than anything else.

For av completion failure, it's not necessary to put it into the error event queue, as it might block the processes waiting for successful AV resolution completion events if they actually arrive before error event.

Not sure what you are saying here - there is no "error event queue" - its all one. error events will not "block" success events, since a call to fi_eq_read() that returns -FI_EAVAIL is always immediately followed by "fi_eq_readerr()" to learn what the error is.

I don't understand the rationale that if an operation fails, two events have to be enqueued, one is the failure event, the other is the completion event.

Can you be more explicit? There is no such general requirement. av_insert in particular works this way because the user may insert a large vector of addresses, some of which fail and some of which succeed. Each failing resolution generates an error EQ entry, and after all is complete (either success or failure) there is a single event to indicate the insertion is done.

Would it be allowed that application and provider can write the same eq?

sure. it's a generic event queue. user may have a seperate thread that wants to send an event to the "event handling" thread, like "clean up and exit". No issue writing to an EQ with a pending error, so long as there is room.

xuywang commented 9 years ago

Not sure what you are saying here - there is no "error event queue" - its all one. error events will not "block" success events, since a call to fi_eq_read() that returns -FI_EAVAIL is always immediately followed by "fi_eq_readerr()" to learn what the error is.

I think fi_eq_readerr() current reads the first item in the queue. If there are pending success event in the queue in front of the error event, then it does not get the correct error event?

Same with the approach to insert an 'EQ overflow' error event if there is one item space in the queue, fi_eq_readerr() can only get the event at the head of the eq, not the error event at the tail.

av_insert in particular works this way because the user may insert a large vector of addresses, some of which fail and some of which succeed. Each failing resolution generates an error EQ entry, and after all is complete (either success or failure) there is a single event to indicate the insertion is done.

This clarifies.

sure. it's a generic event queue. user may have a seperate thread that wants to send an event to the "event handling" thread, like "clean up and exit". No issue writing to an EQ with a pending error, so long as there is room.

I can understand this use case, which can implemented by independent linux event, just feel this causes unnecessary complication. The provider need to consider concurrent access from user application and from its own separate thread. Not sure what gain we get here ...

rfaucett commented 9 years ago

I think fi_eq_readerr() current reads the first item in the queue. If there are pending success event in the queue in front of the error event, then it does not get the correct error event?

hm, probably a bug, tbh :) If the top event is non-error, then fi_eq_readerr() should return something to indicate that there is no error. -FI_EAGAIN ? Want to fix this, add a test case to eq_test.c, and update the man page?

fi_eq_readerr() can only get the event at the head of the eq,

yes, that is correct and by design. The "overflow" error event will not be seen until all the intervening events have been processes, and this is correct for maintaining order.

rfaucett commented 9 years ago

OK, we do the right thing is no error on top:

        /* make sure there is an error on top */
        if (usdf_eq_empty(eq) || !usdf_eq_error(eq)) {
                ret = -FI_EAGAIN;
                goto done;
        }

could still use a testcase for it though.

xuywang commented 9 years ago

Still does not make sense to me.

If there is an error event in the queue but not on top, fi_eq_readerr() returns FI_EAGAIN but cannot dequeue that error event. And the normal fi_eq_read() cannot deque the top success event either because FI_EAVAIL is expected to return when there is an error event in queue.

The "overflow" error event will not be seen until all the intervening events have been processes, and this is correct for maintaining order.

Then what's the purpose of inserting an overflow event? By the time application sees there is an overflow event in the eq, the eq is very likely not full anymore.

rfaucett commented 9 years ago

And the normal fi_eq_read() cannot deque the top success event either because FI_EAVAIL is expected to return when there is an error event in queue.

Error events do not "pass" regular events. -FI_EAVAIL is only returned when the top/next event is an error event. It does not mean "there is an error somewhere in the queue." I think this is the point that is tripping you up - all it means is "the next event is an error, so you need to pass me a different structure so i can give you all the data."

If there are 10 "success" events in the EQ, followed by an error, and you pass a vector of 20 entries to fi_eq_read, you will get back a good return of 10 completions, and the next call to fi_eq_read will return -FI_EAVAIL.

By the time application sees there is an overflow event in the eq, the eq is very likely not full anymore.

The error event does not mean "the EQ is full now" it means "someone tried to inject an event and could not because the EQ was full. you are either processing too slowly or your queue is too small"

Personally, I don't really like the separation of regular events from error events, but they are doing it for two reasons:

1) regular completions are smaller than error completions, so you can ask for an array of completions from fi_eq_read and your array element is smaller. When less common error events occur, then you pass in a pointer to one larger error completion structure to get the error data.

2) Sean claims certain hardware benefits from this separation. We have things we need to get into the API that benefit our HW and are mostly irrelevant for others, so this does not cause me lots of heartburn.

xuywang commented 9 years ago

I see. Error events do not precede success events. I probably got the impression from the man page.

rfaucett / libfabric

usdf_eq_write EQ full handling #10