Better Benchmarking for openapi-fetch?

gzm0 commented 3 months ago

@drwpow, I'm taking the liberty to moving your comment to an issue so we do not lose the discussion.

Also a note—I’ve found the benchmarking to be wildly inconsistent 😓. It’s still in beta for Vitest, and I think we should have another alternative.

Or, put another way, the absolute numbers fluctuate wildly. But the relative measurements have some consistency (i.e. openapi-fetch being XX% faster than another package stays consistent-ish).

It’s been on my list to come up with more reliable benchmarks—would love your thoughts if there’s a better way to profile performance.

Also like I said before, anecdotally I haven’t had good experiences with performance regression tests (or, at least, not without an entire team at a company dedicated to maintaining them, and we don’t have that for this OSS project). CI (especially free CI) just fluctuates too widely with resources to be reliable. But I’m strongly in favor of manually-run benchmarking, that can be run when we want to see if something moves the needle up or down. And I think there’s improvements to be made in this area here, and I welcome your input! 🙏

Originally posted by @drwpow in https://github.com/openapi-ts/openapi-typescript/issues/1810#issuecomment-2265690346

gzm0 commented 3 months ago

First of all, my only real-life experience with benchmarking is on Scala.js (Scala to JavaScript transpiler, I'm one of the maintainers).

So far, we have performed 3 types for benchmarks (I'll compare them to what we probably want here below):

Execution speed of generated code (https://github.com/sjrd/scalajs-benchmarks).
Execution speed of the transpilation pipeline (in the core repo, example).
Memory usage of the transpilation pipeline (example).

We execute these benchmarks on-demand manually (as in: if we have a hunch it might affect performance, not even on release). We have not automated infrastructure to run them whatsoever.

The first benchmark bullet is probably the one most similar to what we'd want here, of course also the one I haven't been working on at all :shrug: .

I think one core difference between the Scala.js and the (current) openapi-fetch benchmarks is that the Scala.js benchmarks are CPU bound (they do little to no I/O). The openapi-fetch benchmarks as they are now, are likely heavily I/O bound: I do not know how msw works under the hood and what parts of the network stack it abstracts, but it would not surprise me if it still needs to invoke quite a bit of it. As a result, it might very well be, that the openapi-fetch benchmarks currently mostly benchmark msw. This could somewhat explain the high fluctuations.

I have a feeling that for better low-level performance analysis, we might need to split the benchmarks:

High-level comparative benchmarks, they answer "is our speed competitive with alternatives"
Low-level CPU bound comparative benchmarks (stub out fetch), they answer "how does this change affect performance" (w.r.t. to JIT and what not).

I doubt it is feasible to build the second kind of benchmark for other frameworks. Maybe the ones that also wrap fetch. But even then, we'd likely need to make assumptions about the specific usage of fetch, etc. (e.g. openapi-fetch parses the JSON response as part of the invocation, if another framework doesn't do that, the comparison gets hard).

HTH. Happy to discuss further.

drwpow commented 3 months ago

Love all these thoughts. I don’t disagree with any of it!

the openapi-fetch benchmarks currently mostly benchmark msw. This could somewhat explain the high fluctuations.

That is a very good observation that I hadn’t considered. you’re right, regardless, msw shouldn’t be part of the benchmarks. fetch should be stubbed in a way that is basically an async nextTick for the purposes of benchmarking.

we might need to split the benchmarks:

Agree completely. Also agree on not bothering with monitoring low-level benchmarks on other libraries, but still having some (manually-run) way to check if any PR significantly impacts JIT performance.

Benchmarking other libraries I like doing for 2 purposes:

The obvious, it’s valuable to have fast, lightweight code. Fast code means a faster user experience, and lightweight code makes the internet more accessible to people that need it (low-speed areas and less-developed countries). Devs don’t have time to evaluate libraries at this level, and this helps them choose.
Less-obvious: benchmarking other libraries helps us identify areas of improvement in our code. We don’t have time to study the internals of all other fetch libraries out there. But if we benchmark them, certain libraries stand out as worth studying, to see how the speed was achieved. And since this is library code, not application code, sweating the details matters and pays dividends across all the people using this

gzm0 commented 3 months ago

benchmarking other libraries helps us identify areas of improvement in our code

That's a very good point. I never thought about it that way. But indeed, that can give you an "achievable baseline".

drwpow commented 2 months ago

Just removed MSW from the benchmarks and that indeed seemed to contribute heavily toward the fluctuations. I’m sure there will still be a little flaking, as is normal, but it seems better from what it was.

But without MSW we’ll have to mock the internals for some of the external libraries more carefully, because it was doing a good job of automatically shimming fetch. So we’ll need to mock axios, superagent, and openapi-typescript-codegen’s internal fetchers without accidentally mocking internal runtime code and giving it a fake “boost” (axios especially, since you can give it a custom fetcher but I want to avoid doing that because that’s an entirely different internal codepath, one that isn’t common; I’d like to mock in a way that all libraries are called with their defaults if at all possible)

gzm0 commented 2 months ago

yes, that's the downside unfortunately :(

openapi-ts / openapi-typescript

Better Benchmarking for openapi-fetch? #1818