triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
517 stars 224 forks source link

Only require throughput stability for PA HTTP async case #688

Closed dyastremsky closed 1 month ago

dyastremsky commented 1 month ago

After fixing the async request rate case, perf analyzer no longer stabilizes most of the time for smaller models due to the latency being more variable in the async case. As a result, only use throughput (inferences per second) for stabilizing. PA quite quickly with these updates.

Before changes: DenseNet image

After changes: DenseNet image

tgerdesnv commented 1 month ago

There are also potential documentation changes needed?

tgerdesnv commented 1 month ago

I'm also concerned about the fact that we still are reporting latency numbers. We may need a way to indicate that the latency is not stable

dyastremsky commented 1 month ago

code looks correct. would be nice to have some sort of testing to confirm and protect.

Correct in what way? You can see the screenshots above. I'll also be adding testing to ensure PA stabilizes to the main feature branch as a separate PR.

dyastremsky commented 1 month ago

I'm also concerned about the fact that we still are reporting latency numbers. We may need a way to indicate that the latency is not stable

Do you think this is something users care about? Will defer to you and @nnshah1 here. It seemed like even our team wasn't super familiar with what stabilization means, so I'm not sure that it's important enough to call out that latency may not have stabilized.

As a counterargument to the point about users not caring, we know that uncapping the request rate in this way can lead to unbounded growth so average latency won't be the most accurate measure. But we do show the latency per run, so users can see that.

dyastremsky commented 1 month ago

There are also potential documentation changes needed?

Thanks for this callout. I'll add these changes.

nnshah1 commented 1 month ago

I'm also concerned about the fact that we still are reporting latency numbers. We may need a way to indicate that the latency is not stable

Do you think this is something users care about? Will defer to you and @nnshah1 here. It seemed like even our team wasn't super familiar with what stabilization means, so I'm not sure that it's important enough to call out that latency may not have stabilized.

As a counterargument to the point about users not caring, we know that uncapping the request rate in this way can lead to unbounded growth so average latency won't be the most accurate measure. But we do show the latency per run, so users can see that.

My preference would be to handle this in documentation and not tool output.

If I understand correctly - we are saying that for cases were request rate >> throughput - latency is effectively unbounded - I think that's reasonable to document. Probably we want to add (if not present) the ability to sweep - i.e. start at low request rate and increase until throughput no longer increases - then stabilize latency at that request rate ....

tgerdesnv commented 1 month ago

code looks correct. would be nice to have some sort of testing to confirm and protect.

Correct in what way? You can see the screenshots above. I'll also be adding testing to ensure PA stabilizes to the main feature branch as a separate PR.

I'm just commenting that the code changes look fine/correct to me.

dyastremsky commented 1 month ago

I'm also concerned about the fact that we still are reporting latency numbers. We may need a way to indicate that the latency is not stable

Do you think this is something users care about? Will defer to you and @nnshah1 here. It seemed like even our team wasn't super familiar with what stabilization means, so I'm not sure that it's important enough to call out that latency may not have stabilized. As a counterargument to the point about users not caring, we know that uncapping the request rate in this way can lead to unbounded growth so average latency won't be the most accurate measure. But we do show the latency per run, so users can see that.

My preference would be to handle this in documentation and not tool output.

If I understand correctly - we are saying that for cases were request rate >> throughput - latency is effectively unbounded - I think that's reasonable to document. Probably we want to add (if not present) the ability to sweep - i.e. start at low request rate and increase until throughput no longer increases - then stabilize latency at that request rate ....

I'll add a message

code looks correct. would be nice to have some sort of testing to confirm and protect.

Correct in what way? You can see the screenshots above. I'll also be adding testing to ensure PA stabilizes to the main feature branch as a separate PR.

I'm just commenting that the code changes look fine/correct to me.

Ah, gotcha. Apologies, I misread that earlier. Thanks! 😄

dyastremsky commented 1 month ago

I'm also concerned about the fact that we still are reporting latency numbers. We may need a way to indicate that the latency is not stable

Do you think this is something users care about? Will defer to you and @nnshah1 here. It seemed like even our team wasn't super familiar with what stabilization means, so I'm not sure that it's important enough to call out that latency may not have stabilized. As a counterargument to the point about users not caring, we know that uncapping the request rate in this way can lead to unbounded growth so average latency won't be the most accurate measure. But we do show the latency per run, so users can see that.

My preference would be to handle this in documentation and not tool output.

If I understand correctly - we are saying that for cases were request rate >> throughput - latency is effectively unbounded - I think that's reasonable to document. Probably we want to add (if not present) the ability to sweep - i.e. start at low request rate and increase until throughput no longer increases - then stabilize latency at that request rate ....

Created a ticket to look into sweeping the request rate (TPA-140).

I added a warning message when the latency is not stabilizing, as discussed (see PR description). I also added a unit test. I can look at adding more extensive testing and the 0 sec avg HTTP time bug as separate PRs into the feature branch.

This pull request is ready for review.