tinylibs / tinybench

🔎 A simple, tiny and lightweight benchmarking library!
MIT License
1.73k stars 36 forks source link

Excessive data retention on fast benchmarks #61

Open jdmarshall opened 10 months ago

jdmarshall commented 10 months ago

I don't think the warmup time for tinybench is anywhere near adequate to getting V8 to settle the code. 'Solving' that problem by deoptimizing tinybench is fighting the wrong fight.

In an attempt to get more accurate data out of tinybench, I've been using time: 2000 or in some cases 6000 for some test suites.

The problem here is that given some leaf node tests are running 2, 3, even 4 orders of magnitude faster than the more complex tests, some of my tests are generating thousands of samples, and others are generating several million. This extra memory pressure is causing some weird race conditions with the code under test.

Now while switching the telemetry calculation to an algorithm that uses k or logn memory complexity for running totals is probably a lot to ask, retaining samples for task A while running task C is a waste of resources.

Expected behavior: Disabling all but one task should result in the same success or failure modes for the code under test. The summaries should be calculated at the end of a task and the working data discarded.

Actual behavior: all statistics are kept for the entire run.

Workaround: iterations attribute instead of time

jdmarshall commented 10 months ago

Because the summary is calculated after teardown, I don't even have the option to force this behavior in my own tests.

Aslemammad commented 10 months ago

I see, thank you so much for submitting an issue. Doesn't the iterations option help?

jdmarshall commented 10 months ago

That's what I'm using as the workaround. I'm definitely still having some sort of issue with ensuring V8 is completely warm, but then that's a very old benchmarking problem.

jdmarshall commented 10 months ago

Do you suppose it would break anyone if you calculated the performance numbers prior to teardown?

I have no way of knowing if anyone but you is depending on the existence of the telemetry data at the end of a run. I could expunge the samples during teardown. It's not awesome, but as a documented workaround I've encountered worse.

But right now teardown seems to be first, which scuttles that idea.

My most problematic library has a couple dozen benchmarks, so clearing the data between tasks is more than an order of magnitude difference in memory utilization.

Aslemammad commented 10 months ago

Do you suppose it would break anyone if you calculated the performance numbers prior to teardown?

I don't think so, we can try and see.

Aslemammad commented 10 months ago

expunge the samples during teardown

The samples are the problem here?

jdmarshall commented 10 months ago

That's where all the data goes, yes. But teardown is the last event I can see between tasks, and it is called immediately before the P## calculations are done. I could cheese something with the setup event, but that involves figuring out what task we are on and looking up the previous task, which seems problematic.

For me the difference in order between the calculations and the event don't matter, excepting whether I can manually prune data prior to the table() call. But I don't know enough people using tinybench enough ways to offer any sort of suggestion on whether changing the order would break their use cases.

I hate adding flags to fix user-specific problems, so a programmatic solution seems slightly more palatable IMO. But bumping the minor version number might be called for if you decide on that option.

jdmarshall commented 10 months ago

I've gone down a rabbit hole. Send help.

https://github.com/tdunning/t-digest

Unfortunately the whole tinybench library is about the size of this algorithm (although it's in Java so maybe divide that by 2?).

So while this is academically cool I expect this is A Bridge Too Far territory.

Aslemammad commented 10 months ago

Haha, I see, it's getting a bit complicated. but when you gather all your findings, summarize them so we can work on them soon.

jdmarshall commented 10 months ago

This one doesn't seem too bad. The actual logic part of it is about 2 computer screens, and it's based on a refinement of t-digest. It's also the only one that seems to be moderately actively maintained.

https://github.com/hville/sample-distribution

The algorithm does have fairly enticing amortized cost overheads for space and computation. Less than the sort() call.

But really just inverting the calc and event emit phase is probably going to solve most people's problems. If your test can't manage 50 mb of telemetry per app you've got bigger fish to fry. If you tried to scale this up to some sort of replacement for Gatling or blazemeter, then you would really want to look at this.

polak-jan commented 10 months ago

I have had vitest crash while running benchmarks, and I am pretty sure this was the reason why. The crash was caused by heap memory being filled up. On subsequent runs with lower benchmark times I am easily getting over 5GBs used by the process, which seems to be mostly the samples.

Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing.

jdmarshall commented 10 months ago

Workaround: It looks like you could clear the extra statistics in the 'cycle' event (though to me 'teardown' sounds like the spot where one would expect to manipulate system state).

psirenny commented 7 months ago

Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing.

@polak-jan could you explain your workaround a bit more?

polak-jan commented 7 months ago

@psirenny I just added a loop around the entire benchmark, so each run of the benchmark is actually X number of runs. This isn't really ideal as it can interfere with things like cache and JIT states. But I don't think there is any other reasonable way to get around the limitation right now. And assuming all your benchmarks use the same workaround they should all be offset by a similar amount so comparisons should still be mostly accurate.