petkaantonov / bluebird

:bird: :zap: Bluebird is a full featured promise library with unmatched performance.
http://bluebirdjs.com
MIT License
20.45k stars 2.33k forks source link

Benchmark: unfair advantage to some implementations #610

Closed avih closed 9 years ago

avih commented 9 years ago

The benchmark results depend immensely on the performance of the denodeify implementation - results can change by an order of magnitude depending on the performance of denodeify (this also holds true for the .all implementations, though to a lesser extent).

However, not all Promise implementations which are being tested at the benchmark implement this functionality internally, and for those, Bluebird benchmark includes a "fake" implementation at benchmark/lib/fakesP.js.

This file also includes a fallback implementation as:

else {
    var lifter = require('when/node').lift;
}

which is used for at least one of the benchmarked implementations: promises-dfilatov-vow.js (beyond the obvious promises-cujojs-when.js).

I believe that the benchmark should only test what the library provides, and if the library's API is slightly different than what the code needs, then a wrapper should be used in a way which will not put this library at a disadvantage. However, I believe that the wrapper at fakesP.js is very far from providing equal grounds.

The following examples are with parallel, but a similar pattern also shows with doxbee.

Example: normal run of the benchmark on my system using ./bench parallel:

results for 10000 parallel executions, 1 ms per I/O op

file                                time(ms)  memory(MB)
promises-bluebird.js                     359      109.33
promises-bluebird-generator.js           422      113.92
promises-cujojs-when.js                  594      164.51
promises-tildeio-rsvp.js                 625      217.22
callbacks-caolan-async-parallel.js       922      225.85
promises-lvivski-davy.js                 922      280.55
promises-calvinmetcalf-lie.js           1062      380.36
callbacks-baseline.js                   1093       37.64
promises-dfilatov-vow.js                2187      534.49
promises-ecmascript6-native.js          2391      542.76
promises-then-promise.js                2453      695.89
promises-medikoo-deferred.js            3078      535.80
promises-obvious-kew.js                 4594      963.20

Platform info:
Windows_NT 6.3.9600 x64
Node.JS 1.8.1  <-- that's actually io.js v1.8.1
V8 4.1.0.27
Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz × 4

However, if we modify fakesP.js to always use one implementation for all the libraries with the following patch which changes it from fallback to always:

-else {
     var lifter = require('when/node').lift;
-}

Then the results look like this (obviously same system and same invocation as above):

results for 10000 parallel executions, 1 ms per I/O op

file                                time(ms)  memory(MB)
promises-cujojs-when.js                  563      163.62
promises-lvivski-davy.js                 625      198.71
promises-then-promise.js                 672      218.51
promises-bluebird.js                     891      251.84
promises-bluebird-generator.js           937      255.72
callbacks-caolan-async-parallel.js       969      225.84
promises-tildeio-rsvp.js                1078      360.25
callbacks-baseline.js                   1110       37.63
promises-obvious-kew.js                 1203      319.76
promises-calvinmetcalf-lie.js           1750      431.94
promises-dfilatov-vow.js                2157      535.04
promises-medikoo-deferred.js            3359      554.31
promises-ecmascript6-native.js          6188      937.73

and suddenly the numbers look very different. Some implementation go up the ladder, while other go down (well, except for promises-dfilatov-vow.js and promises-cujojs-when.js which were already using this implementation before the patch, though they may still appear now higher or lower because others have moved).

For instance:

Relative to each other, some of the results changed by a factor of ~10.

The results with the patch can actually be considered much more "fair", since it makes use of "external independent" code (external to all the three libs listed above) as lifter, thus putting them on equal grounds.

(I do realize that of those three, only kew has a wrapper implementation at fakesP.js, while the other two do seem to provide internal implementation, but the example is to make a point of how the results can change when using the same "construction" to test different implementation).

So, what does this mean? It means that code outside the benchmarked library greatly affects the numbers which the benchmark reports as the score of that library, and this effect (and its magnitude) is different for different libraries.

How can it be improved?

IMO the best thing to do is to use the absolutely minimal API possible from the libraries (probably what the specifications define and what the promises-aplus-tests suit expects - plain Promise API), and then build the whole benchmark construction on top of that. This way, libraries which don't implement denodify or all will still get exactly the same treatment as those who do - benchmarking of their Promise implementation.

Obviously, not everyone would like this suggestion, since clearly some/many library authors have put a lot of time and effort into improving and fine tuning these features, and with such suggestion this effort might not manifest at all at the benchmark.

OTOH, authors of libraries which have not implemented those (and possibly others - I didn't examine the benchmark code much) would finally have a benchmark which actually tests only what their library provide, and doesn't make their library "look bad" due to code which isn't part of the library.

The solution to this is, IMO, a different benchmark for features which are not considered core features of Promises.

One (or several) benchmark will build only on top of the common Promise features and will test how good does that work in itself and compared to other libraries, and another benchmark which can test the performance of "extra" features. If a library provides this feature, it would make sense to test it.

But what does not make much sense IMO, is that the benchmark itself includes wrappers which put some implementations at great disadvantage as far as the benchmark results go.

petkaantonov commented 9 years ago

If you have faster implementation for wrappers for libraries that don't provide their own, a PR is welcome.

However the point of the benchmark is realism, library without a denodeify implementation is virtually unusable, and anyone using such library for some reason would realistically implement their own or perhaps install some addon from npm.

petkaantonov commented 9 years ago

Also that else case is not a fallback, the code only reaches there when using when.js... It makes absolutely no sense to use that function when benchmarking any other library.

benjamingr commented 9 years ago

What petka said basically.

Fast promisification is a core feature for a library and most libraries provide it. The goal of the benchmarks are to test the promise library's performance - not .then. Converting callback APIs is a core problem promise libraries have to deal with and using an implementation that can't make library specific optimizations isn't fair to the actual way people use the libraries.

avih commented 9 years ago

However the point of the benchmark is realism, library without a denodeify implementation is virtually unusable

That is SO not true. There are many real world use cases of Promises outside of node and without the need to promisify/denodify, and even more cases where denodeify is not part of any hot loop, and the Promise implementation itself is the thing which runs over and over, while the performance of denodify matters not.

However, this benchmark effectively quietly decides to make denodeify the biggest representative "practicality" in Promises, and then slaps some implementation which isn't part of the library, and calls it a day.

While I don't disagree that denodify is indeed a very useful feature, you can't just pretend to compare speed of Promise implementations while in fact a very big part of the score is determined by whatever implementation denodify happens to end with.

If you want to compare the performance of denodify, make a benchmark which tests denodify on libraries which provide such implementations, and call it denodify.

But if you pretend to compare different implementations of promises on equal grounds, you should benchmark only what the libraries provide, and where a wrapper is needed, you should be 110% sure that the wrapper doesn't bias the results.

There are many places on the web pointing to this benchmark, and quite a few of them wonder how come Q is so slow, while in fact, what they don't know is that it only appears slow due to the nature of the benchmark which is mostly affected by the denodeify implementation which the benchmark implements for Q.

If you take an independent function as a lifter for the benchmark, then suddenly everything looks very different, and especially Q (and no, I have no relation to Q). Of course, with an external lifter it now measures a different thing, but now that different thing is the same for all the tested libraries - which is what a benchmark should do.

The benchmarks results are overall highly misleading, and more importantly are hiding the fact that a major part of the score comes from the "extras" functions which some libraries provide and from the wrappers for libraries who don't have them - which the benchmark itself provides, while the performance of the actual Promise implementation is only a small part of the score.

It's just misleading.

petkaantonov commented 9 years ago

denodeify implementation which the benchmark implements for Q.

Why would you make so obviously blatantly false statements? Seriously?

avih commented 9 years ago

Because I'm human and I've mistaken kew with Q. Sorry for that.

Regardless, In my first post I've given example of three libraries which change their relative scores greatly once a common external lifter is used, and therefore my claim remains valid.

petkaantonov commented 9 years ago

You cannot use a common lifter like you did, as I explained in my comment:

Also that else case is not a fallback, the code only reaches there when using when.js... It makes absolutely no sense to use that function when benchmarking any other library.

avih commented 9 years ago

And I said myself earlier that obviously it now tests a different thing (possibly not a very good thing), but that different thing is now identical between all the tested libraries, and a benchmark should test the same thing between different libraries.

petkaantonov commented 9 years ago

Yes, it's not a very good thing - you are basically benchmarking when.js 4 times under 4 different labels :)

benjamingr commented 9 years ago

Thanks for the discussion @avih

That is SO not true. There are many real world use cases of Promises outside of node and without the need to promisify/denodify

Right, but this benchmark checks promise performance on the server. For the client promise performance usually doesn't matter that much - rarely are 10000 promises even created at once or sequentially in a client app.

There are many places on the web pointing to this benchmark, and quite a few of them wonder how come Q is so slow

Q never attempted to actually be a "fast" implementation. There are so many low hanging fruit there it's not even funny :) It's ok though, again, not the goal of the library. I think there is a comparison somewhere.

If you take an independent function as a lifter for the benchmark, then suddenly everything looks very different

But then it doesn't measure the actual performance of using the library in real code. The whole point of the benchmark was to simulate actual usage of the library and not be a synthetic microbenchmark. If you don't let libraries do the work you're not testing bluebird or Q, you're testing "bluebird-promisification+lifter".

The benchmarks results are overall highly misleading, and more importantly are hiding the fact that a major part of the score comes from the "extras" functions which some libraries provide and from the wrappers for libraries who don't have them - which the benchmark itself provides, while the performance of the actual Promise implementation is only a small part of the score.

I don't think it is. Even when you fix the benchmark - I don't think we should aim for synthetic stuff over actually testing the performance of real life code.

avih commented 9 years ago

Yes, it's not a very good thing - you are basically benchmarking when.js 4 times under 4 different labels :)

Finally we don't disagree ;)

My point still stands though. The benchmark benches denodify more than anything else, and on cases where the library doesn't provide it, it benches its own wrapper implementation and then gives the score to the library.

This is bad practice by a benchmark, especially when it does so knowingly but doesn't advertise it anywhere.

benjamingr commented 9 years ago

I don't think you understand @avih - what Petka is pointing out is when you use the WhenJS "lifter" you're creating a WhenJS promise which other libraries have to treat as a foreign thenable according to the spec rather than take the fast path.

petkaantonov commented 9 years ago

My point still stands though. The benchmark benches denodify more than anything else, and on cases where the library doesn't provide it, it benches its own wrapper implementation and then gives the score to the library.

No. I would even say that the wrapper implementation provided by benchmark is making those libraries even faster than they would be in real code because the library would force people to write their own wrappers which are likely not as optimized.

avih commented 9 years ago

I don't think you understand @avih - what Petka is pointing out is when you use the WhenJS "lifter" you're creating a WhenJS promise which other libraries have to treat as a foreign thenable according to the spec rather than take the fast path.

Agreed completely, and understood exactly that even before that comment. And yet it does also test the libraries, and now it tests how well they interop with other implementation according to the spec, and it tests the same thing for all libraries, and the libraries do come up with significantly different scores.

With the common lifter it benches a different thing - but it does so correctly.

No. I would even say that the wrapper implementation provided by benchmark is making those libraries even faster than they would be in real code because the library would force people to write their own wrappers which are likely not as optimized.

If they need denodify. it was agreed already that's not given.

petkaantonov commented 9 years ago

If they need denodify. it was agreed already that's not given.

If you don't need denodeify then you are using API which is essentially doing denodeify themselves, it's still there though because promises don't magically appear out of nowhere, they always wrap a low level api that uses callbacks in one way or another.

benjamingr commented 9 years ago

If they need denodify. it was agreed already that's not given.

I'd like you to name one Node use case where you never need to promisify a library in your code (And performance matters).

and the libraries do come up with significantly different scores.

Of course, because now your benchmark is full of WhenJS promises. Of course when is going to come up on top if bluebird has to assimilate foreign thenables and When does not.

avih commented 9 years ago

I'd like you to name one Node use case where you never need to promisify a library in your code (And performance matters).

As I mentioned earlier, there's much use of Promises outside of node and without the need for denodify/promisify, and even more cases where promisify is not in any hot paths. Let's not loop the discussion.

Of course, because now your benchmark is full of WhenJS promises. Of course when is going to come up on top if bluebird has to assimilate foreign thenables and When does not.

I explicitly excluded when from the examples I posted and explicitly mentioned that for these three libraries, the when implementation is external and independent.

The point is equal grounds, and with the common lifter using when obviously when is not on equal grounds as the others and is expected to show better results. It was an example how equal grounds changes the game.

benjamingr commented 9 years ago

As I mentioned earlier, there's much use of Promises outside of node and without the need for denodify/promisify ... Let's not loop the discussion.

I'm not looping the discussion - I'm trying to get an answer from you after you didn't answer last time. As for the "outside of node part" - No, there really isn't. I'd like you to name one use case of promises where you create north of 10000+ promises in a short time spam outside of node.

I explicitly excluded when from the examples I posted and explicitly mentioned that for these three libraries, the when implementation is external and independent.

But you're testing the wrong thing (tm) - your benchmark now focuses on external assimilation which misses the point and is not a part of an actual workflow... Again - as Petka said you probably want to focus on submitting PRs to libraries for fast promisification (or write NPM packages) if you want to see them perform that part of the benchmark quickly...

petkaantonov commented 9 years ago

Your point basically appears to be that promise assimilation performance is not that different between implementations. Yes I agree. I disagree that it has any relevance to most real world code, hence it's not a metric that is being benchmarked.

avih commented 9 years ago

Your point basically appears to be that promise assimilation performance is not that different between implementations.

I probably didn't explain myself clearly. With the common when lifter the scores for the different libs (excluding when) vary between 600 ms and 6000 ms, and memory usage between 200M to 900M (my first post).

At this specific twisted (but correct) test, different libs perform very differently from eachother.

Your point basically

Is that if the benchmark measures denodeify performance more than anything else, it should be advertised as such (and I agree it's a good thing to test), but it should not pretend explicitly or implicitly to measure performance of what the specification define as Promises.

domenic commented 9 years ago

@benjamingr

For the client promise performance usually doesn't matter that much - rarely are 10000 promises even created at once or sequentially in a client app.

We're starting to see this become false in Chrome 43+, with streams. E.g. take a look at the rate promises are created in https://domenic.github.io/streams-demo.

@petkaantonov

If you don't need denodeify then you are using API which is essentially doing denodeify themselves, it's still there though because promises don't magically appear out of nowhere, they always wrap a low level api that uses callbacks in one way or another.

False on the client, in general.


Overall I think that, if you step back from the way he's presented it and his big confusion about when/lifter, @avih has a point worth listening to. Namely that the benchmark does not sufficiently advertise that it's meant to guide your choices in Node-like environments. It's still a great benchmark.

But there's also value in a benchmark that would be more applicable in client-side environments, where indeed promises often do come from nowhere, and we're seeing big deltas between e.g. the slow V8 implementation and Bluebird. Although, it's hard to write such a benchmark with actual I/O involved, since e.g. streams only deal in native promises.

benjamingr commented 9 years ago

I definitely agree that the benchmark is only characteristic of a server side workflow Domenic. No argument there.

A benchmark for client-side environments would be nice. As for streams not exposing a hook for specifying a promise implementation that can be done on your end (by for example - allowing subclassing and overriding a protected getPromise method) but honestly I'm not sure if it's worth it.

petkaantonov commented 9 years ago

Namely that the benchmark does not sufficiently advertise that it's meant to guide your choices in Node-like environments

The benchmark is described as being taken from an article called Analysis of generators and other async patterns in node. I guess it could always be clearer that it's only relevant to server-side though.

avih commented 9 years ago

I'm not looping the discussion - I'm trying to get an answer from you after you didn't answer last time. As for the "outside of node part" - No, there really isn't. I'd like you to name one use case of promises where you create north of 10000+ promises in a short time spam outside of node

I haven't counted them, but Firefox (the web browser) for instance uses Promises internally as one of the main means for asynchronous execution, and it completely replaces "normal" callbacks whenever a code with callbacks is touched. And AFAIK it doesn't have and promisify-like usage in any hot paths.

Of course, Firefox is not likely to choose an external Promises library to handle its internals, but that's a case where Promises are used extensively and in hot paths.

There's also use of Promises on the web to handle the page itself, especially in application-like web sites, and that's another case promisify-like functions are probably not in any hot path, but performance of Promises is still important - both speed and memory use.

The benchmark is described as being taken from an article called Analysis of generators and other async patterns in node. I guess it could always be clearer that it's only relevant to server-side though.

Thanks, I don't think I've seen this reference in the past.

Also, again, I completely agree that benchmarking denodify is a very useful.

It's a shame IMO that most places on the web which mention this benchmark and compare performance of Promise implementations apparently fail to understand what this benchmark is mostly affected by.

petkaantonov commented 9 years ago

Thanks, I don't think I've seen this reference in the past.

It's in the benchmark section of the readme https://github.com/petkaantonov/bluebird#1-doxbee-sequential . Anyway I am working on a website with the 3.0 release which will contain the explanations and results on the same page so this confusion will be lessened

avih commented 9 years ago

Anyway I am working on a website with the 3.0 release which will contain the explanations and results on the same page so this confusion will be lessened

I hope so too. Thanks.

If I find the time, I'll try to write another benchmark within the same framework which focuses on the performance of core parts of Promises and tries to put all the libraries on as equal grounds as I can make it.

If I ever get to that, would you accept a PR for such new benchmark?

benjamingr commented 9 years ago

Why would bluebird add more benchmarks? You can host them on your own GH.

It would be interesting for a benchmark to check client-side performance by the way.

petkaantonov commented 9 years ago

@avih If you can somehow demonstrate that it is relevant to real code and can actually be a performance problem realistically, then yes. E.g. @domenic said this:

We're starting to see this become false in Chrome 43+, with streams. E.g. take a look at the rate promises are created in https://domenic.github.io/streams-demo.

avih commented 9 years ago

Why would bluebird add more benchmarks? You can host them on your own GH.

That's always an option, but bluebird already has a framework (which I haven't looked into much, admittedly), and a setup which can compare different implementation with different kinds of workloads (at the very least parallel and doxbee), and I think it would be better for users to have them in one place instead of setting one up on its own.

But anyway, I'm not there yet.

It would be interesting for a benchmark to check client-side performance by the way.

Agreed. I haven't tried browserify with BB, but it might work.

Also, there's this http://jsperf.com/promise-comparisons/154 , though I think it's currently broken, possibly due to CDN issues.

spion commented 9 years ago

Benchmarking was only one part of the analysis and I don't think I was aiming for a well designed framework when I put the original benchmark together. But I haven't looked at the BB version lately so @petkaantonov might have improved it significantly.

But yes, you're right - the benchmark was very much node focused as I was looking to replace callback hell with something else in doxbee (which uses node as a platform). A large part of that was definitely the ability to consume common node libraries.

I'm not really familiar with the client side use cases where promise performance matters so I don't think I'd be able to make a good benchmark for that. But it definitely sounds like a good idea.

avih commented 9 years ago

I'm not really familiar with the client side use cases where promise performance matters so I don't think I'd be able to make a good benchmark for that. But it definitely sounds like a good idea.

I think you might be surprised by how effective a very simple test can be. I happen to have some experience with testing performance of code, and usually have good hunch for what kind of code makes good tests (good = low noise level, responds well to implementation changes, etc). Hint - the simplest tests are usually the most effective ones (as long as you have several simple tests where each one covers a different topic).

What I had in mind is similar to http://jsperf.com/promise-comparisons/154 , but within the BB framework which will also report memory usage, etc.

I'll try to give it a go at some stage.

benjamingr commented 9 years ago

No offense but that jsperf is a pretty terrible benchmark :D

On Wed, May 6, 2015 at 5:46 PM, avih notifications@github.com wrote:

I'm not really familiar with the client side use cases where promise performance matters so I don't think I'd be able to make a good benchmark for that. But it definitely sounds like a good idea.

I think you might be surprised by how effective a very simple test can be. I happen to have some experience with testing performance of code, and usually have good hunch for what kind of code makes good tests (good = low noise level, responds well to implementation changes, etc). Hint - the simplest tests are usually the most effective ones.

What I had in mind is similar to http://jsperf.com/promise-comparisons/154 , but within the BB framework which will also report memory usage, etc.

I'll try to give it a go at some stage.

— Reply to this email directly or view it on GitHub https://github.com/petkaantonov/bluebird/issues/610#issuecomment-99497426 .

avih commented 9 years ago

No offense but that jsperf is a pretty terrible benchmark :D

None taken, but also not a very productive comment :)

benjamingr commented 9 years ago

Yeah, sorry. In second thought I should have waited until I got to the computer and written a more constructive comment, I'm grumpy all day today :/

I think we should aim for benchmarks that simulate a real end-to-end client-side application making ajax requests and possibly operating streams and performing other asynchronous calls rather than just one aspect of the library.

Very short benchmarks (micro-benchmarks) have a tendency to measure unrealistic use cases and jsperf is infamous for benchmarking issues - I'm sure Petka and Gorgi both have a lot of insight into performance measurement that I don't and can possibly provide more accurate feedback.

I think we should consider something that perhaps simulates what the server-side benchmark works against (or something similar)

On Wed, May 6, 2015 at 5:49 PM, avih notifications@github.com wrote:

No offense but that jsperf is a pretty terrible benchmark :D

None taken, but also not a very productive comment :)

— Reply to this email directly or view it on GitHub https://github.com/petkaantonov/bluebird/issues/610#issuecomment-99498131 .

avih commented 9 years ago

I think we should aim for benchmarks that simulate a real end-to-end client-side application

Such thing definitely has a place, but IMO it would be hard to come up with something which "simulates end to end client app" because there are so very many different usage patterns. Personally I don't think I can pretend to be able to do that and be happy about the result.

OTOH, measuring strict and well defined aspects of the core performance is something I think I can do and end up happy with.

The benefit of such minimalist approach is that whoever looks at the results can extrapolate to their own use pattern relatively easily, while the other way around (complex test from which one needs to deduce if the library fits his use pattern) is much harder to use effectively.

But if there are concerns with such approach, I'm happy to discuss them. The goal is to come up with something useful at what it does.

benjamingr commented 9 years ago

The benefit of such minimalist approach is that whoever looks at the results can extrapolate to their own use pattern relatively easily, while the other way around (complex test from which one needs to deduce if the library fits his use pattern) is much harder to use effectively.

The problem is that it's hard to write good benchmarks and many get them wrong. It's hard to get them right and even when you do - people usually don't understand them very well.

I agree that creating an "end-to-end" client side app is challenging but there are several characteristics that they usually have in common - making ajax requests to the server for example.

avih commented 9 years ago

The problem is that it's hard to write good benchmarks and many get them wrong. It's hard to get them right and even when you do - people usually don't understand them very well.

Agreed. It still doesn't make them useless. It has to be evaluated by peers in order to be a good test, and has to be explained to users to understand what the scores mean. Not unlike the bluebird benchmark.

And still I think the smaller benchmark have very good value, and have the excellent quality of being obvious as for what they measure.

I agree that creating an "end-to-end" client side app is challenging but there are several characteristics that they usually have in common - making ajax requests to the server for example.

Indeed, but ajax is likely IMO to completely throttle the benchmark, which kinds of makes it ineffective. OTOH, if the server runs on the same machine as the client, or even within the same node environment, then it could end up effective. Experiments could answer this question better than my hypotheses.

It's not my thing to write, but I'd definitely be interested to see what it produces.

benjamingr commented 9 years ago

Indeed, but ajax is likely IMO to completely throttle the benchmark, which kinds of makes it ineffective.

Yes, because promise performance isn't really that important on the client side for that typical use case. That was my initial point. If you can think of a client use case that does require fast promises (maybe with streams) then I'm definitely interested in performance results :)

And still I think the smaller benchmark have very good value, and have the excellent quality of being obvious as for what they measure.

I think that in v8 there are very few people who can actually understand what a performance benchmark means - I think @petkaantonov is probably one of those people. The average or even advanced JS developer sees these benchmarks and comes up with conclusions like "I should use .join("") instead of adding strings which are usually irrelevant to anything but a very specific case where doing it is the benchmark.

Fast code is written by profiling applications and re-writing specific parts.

If I tell you now that allocating a bluebird promise is cheaper than allocating a RSVP one - would you use it in your app? What about if I told you that .then assimilates foreign thenables in when than in kew - does it even matter for your app? What if I told you that Promise.resolve is 10 times faster in Q than in any other promise library - would that be a considerable advantage to use Q in your opinion?

In practice, I've never ever seen a promise bottleneck on the clientside, What Domenic said about streams might change that and I'm definitely interested in seeing that.

The reason bluebird has a benchmarks section in the first place is not to brag that it's the fastest (though it indeed is) - it's in order to convince people that promises do not cause considerable overhead and that they're a viable alternative to callbacks in their applications. They add promises when they see they're about as fast as callbacks and then stay for awesome things like unhandled rejection detection, long stack traces and so on.

petkaantonov commented 9 years ago

Indeed, but ajax is likely IMO to completely throttle the benchmark, which kinds of makes it ineffective.

Well, no shit.

At best I could believe an application making 60 requests per second to server and even with complicated promise chains that would be only e.g. 600 promises created per second. That's way too few promises being created for the implementation's performance to matter at all. Even if you swapped from Q to bluebird to get 100x improvement, the app would still perform the same.

That is why there is no client side benchmarks. You need to come up with a plausible scenario where you for some reason needed to create e.g. 500,000 promises per second, and there is none.

avih commented 9 years ago

Well, no shit.

Hey, I didn't suggest this approach. I just had the same opinion as you, with the exception that I wasn't as conclusive before actually experimenting with such benchmark approach.

That is why there is no client side benchmarks. You need to come up with a plausible scenario where you for some reason needed to create e.g. 500,000 promises per second, and there is none.

This is probably incorrect, and I can guess at least one pattern where it would matter: using promises in calculation intensive tasks. I don't think I've bumped into such use case though, but OTOH I also wasn't looking.

avih commented 9 years ago

If I tell you now that allocating a bluebird promise is cheaper than allocating a RSVP one - would you use it in your app? ...

Not on its own, but with several such results for different core aspects It would definitely tell which implementation does which things better than other implementations, and I could definitely make a wiser decision when choosing an implementation to use in my own scenario.

OTOH, with a benchmark which tries to cover everything and the kitchen sink, it would only help me choose an implementation if my use case has exactly the same pattern as the benchmark, and if I understood exactly what this pattern is.

domenic commented 9 years ago

I agree microbenchmarks are useless and a realistic end-to-end test is the only possible valuable output of this conversion.

With regard to client scenarios where these things matter. It's important to remember that you have 16 ms per frame, if you want to avoid jank. If creating/manipulating the appropriate number of promises takes longer than that, promise performance has started to matter.

Another place we ran into this recently was in trying to use promises for asynchronous layout APIs (which literally would run every frame). You can see some results here. As you can see in some moderately-complex cases Bluebird was able to complete the task in less than 1 millisecond, while V8 promises took over 16 ms (a complete failure).

So, 500K promises per second isn't exactly the only scenario.

avih commented 9 years ago

With regard to client scenarios where these things matter. It's important to remember that you have 16 ms per frame, if you want to avoid jank. If creating/manipulating the appropriate number of promises takes longer than that, promise performance has started to matter.

What I had in mind actually included reference to exactly that: how much the implementation blocks the main thread. There's an obvious tradeoff between speed and responsiveness, and where an implementation stands within this range is a very useful thing to know indeed.

petkaantonov commented 9 years ago

If I tell you now that allocating a bluebird promise is cheaper than allocating a RSVP one - would you use it in your app? ...

It would definitely tell which implementation does which things better than other implementation, and with several such results for different core aspects I could definitely make a wiser choice when choosing an implementation to use in my own scenario.

What if I told you that it could be a trade-off for improving real-world performance? This is exactly the kind of thing that microbenchmarks miss.

This reminds me of a case where IE10 (or 11) performed 100x faster than other browsers in one sunspider microbenchmark. It turned out that IE's dead-code elimination step got completely rid of the stupid benchmark and it did no actual work. So real world code is paying for a more thorough dead-code elimination step that will likely not find any dead-code at all but broken benchmarks are showing a "massive improvement". Is that a good trade-off?

avih commented 9 years ago

It turned out that IE's dead-code elimination step got completely rid of the stupid benchmark

Obviously the implementation should not be be rigged to handle the benchmark, but otherwise it's a a very good point IMO that some optimizations can completely bypass the benchmark.

With that fine point in mind, it would be quite easy to make sure the benchmark avoids such pitfalls, by making sure that the benchmark verifies against an expected result which couldn't be reached by means other than going through the designed benchmark code flow - or by rigging the implementation.

petkaantonov commented 9 years ago

I think you missed the point. I meant that you can completely optimize an implementation with micro benchmarks in mind and have completely opposite real world performance. So you cannot really extrapolate anything from microbenchmarks. You can use them to double check your local optimizations internally, but using them to extrapolate big picture results is just not possible.

avih commented 9 years ago

Anyway, this has become quite off topic.

The main topic was that this benchmark is referenced in many places as a good tool to compare Promises implementations, but the fact that it tries to measure mostly server/node scenarios and especially that the score depends greatly on the implementation of denodeify is ignored/missed more often than not, so it should probably get more advertising.

Hopefully the 3.0 web site would address this.

avih commented 9 years ago

So you cannot really extrapolate anything from microbenchmarks

I wouldn't go as far as categorically saying this, and there's a great deal of range below "full app-like scenario" which the benchmark can be positioned at.

That depends on what the benchmark does, which should obviously be open for criticism - once there's something to criticize.

spion commented 9 years ago

@avih what I meant was that I'm not very familiar with real usage scenarios where client-side promise performance matters, so I wont know how to write a benchmark that simulates them well. Not saying that there aren't any (the use cases that you and @domenic described are interesting), but I imagine there would be a lot more work there (e.g. which implementations avoid putting pressure on the GC to prevent too many frame rate skips when the GC collects - how to measure that? :D)

In the context of node I think that including denodeify is fair. It would be misleading to publish great numbers for a library that doesn't provide one. Users would most likely go ahead and write a naive implementation, still expecting great performance. Then they'd wonder why they are not quite getting it.

avih commented 9 years ago

In the context of node I think that including denodeify is fair. It would be misleading to publish great numbers for a library that doesn't provide one. Users would most likely go ahead and write a naive implementation, still expecting great performance. Then they'd wonder why they are not quite getting it"

Agreed with all of the above - as long as it's clear what the scores represent and which factors affect the score the most. Otherwise, if it's unclear or assumed incorrectly - as I believe happens more often than not by people who are unrelated to this project, then it could be highly misleading.

And it was already stated that this issue would hopefully get addressed at the next version of the website.