Hide representations while maintaining performance

dsyme commented 9 years ago

I've looked into what it would take to internalize the representations of streams while keeping performance where code is well inlined and at least as much fusing occurs as today.

The aim is to put this library on a sustainable binary-compatible basis while keeping performance.

The PR is an extension of #30 which should be merged first - when that is done the diff below will reduce

The formulation I've used below seems to work

All the primary functions such as "map", "fold", "scan" and "iter" boil down to three public functions in a Internals module called mapCont, mapContCancel and iterCancel. These functions are public but not for direct use - they leak a few representations details such as the use of a continuation to sink results and the use of CancellationToken to cancel. However crucially they don't leak the representation details or Stream, Context, Iterable or Iterator. I would expect these functions to be implementable no matter what future changes we make to the representation of the Stream, Context, Iterable and Iterator types.
Some care is taken to avoid partial applications in a couple of situations, see nocurry()

It would be very cool if we could somehow find a way to fully fuse map --> map --> map --> map chains, however I haven't been able to do that, I don't think it's possible with F# 4.0's optimizer, even when we fully leak all representations.

dsyme commented 9 years ago

(I removed the bit about avoiding tailcalls since it doesn't make things any faster)

The performance on the streamTest() in StreamsTest.fs after these changes is:

  //   Real: 00:00:02.384, CPU: 00:00:02.375, GC gen0: 0, gen1: 0, gen2: 0
  //   Real: 00:00:02.490, CPU: 00:00:02.484, GC gen0: 0, gen1: 0, gen2: 0
  //   Real: 00:00:02.362, CPU: 00:00:02.343, GC gen0: 0, gen1: 0, gen2: 0

The performance of 0.3.0 is comparable (actually a bit slower)

  //   Real: 00:00:03.352, CPU: 00:00:03.343, GC gen0: 0, gen1: 0, gen2: 0
  //   Real: 00:00:02.631, CPU: 00:00:02.640, GC gen0: 0, gen1: 0, gen2: 0
  //   Real: 00:00:02.858, CPU: 00:00:02.843, GC gen0: 0, gen1: 0, gen2: 0
  //   Real: 00:00:03.156, CPU: 00:00:03.156, GC gen0: 0, gen1: 0, gen2: 0

palladin commented 9 years ago

It is a little bit slower ~ 1 sec?

palladin commented 9 years ago

Btw I'm doing research with @biboudis on multi-stage programming and stream fusion in MetaOcaml and Scala LMS and we are very positive that we can port some of our research ideas to F#.

dsyme commented 9 years ago

@palladin - It's actually a bit faster in this configuration, with the tailcalls removed, see https://github.com/nessos/Streams/pull/31#issuecomment-111158738. It seems sensitive when the leaf functions are simply +/- operations. In any case it would be good to try to verify that I haven't horked perf.

dsyme commented 9 years ago

@palladin - How do wee get @biboudis to move his research to F#, then we could all cooperate very easily :)

palladin commented 9 years ago

@dsyme It is very easy... MSR funding :)

palladin commented 9 years ago

@dsyme Is it possible to check the performance with the 64 bit jitter (because I've seen huge differences between 32 and 64)

palladin commented 9 years ago

And I'm pretty sure that the let inline are critical for the extreme 64 bit jitter peformance

dsyme commented 9 years ago

Yes, I've been using 64-bit. The inlining is still carefully orchestrated to boil things down to mapCont and iter in a way that the existing inlining of function calls was preserved. But you should verify perf.

biboudis commented 9 years ago

Research through osmosis of ideas may lead us to F# very quickly :-)

dsyme commented 9 years ago

The diff between the "fix iterator()" ( #30 ) and "hid representation" ( #31 ) PRs is here: https://github.com/dsyme/Streams/compare/fix-iterator...dsyme:hide-reprs

dsyme commented 9 years ago

@biboudis But I want to osmote too :) . BTW the reason I'm doing these PRs is to finally deeply understand what you guys have been doing :)

palladin commented 9 years ago

@dsyme Based on my performance assessment everything looks ok!

dsyme commented 9 years ago

@palladin BTW at some point I also hope to look at hiding the representation details for ParStream. I understand that might be a bit harder :) Do you have a feeling for whether the inlining is less crucial there? And do you have a feeling for whether existing consumers of ParStream take advantage of the representation details, e.g. ParIterator, Collector, SourceType and the abstract members on ParStream?

palladin commented 9 years ago

Based on my experience the inline is equally important in ParStream, because ParStream and PSeq are designed for cpu-multicore (throughput) scenaria.

palladin commented 9 years ago

One important fact is also that CloudFlow is using the ParIterator internally.

dsyme commented 9 years ago

@palladin - ok, thanks. I think we can play similar games to #31 and at least make these expressions here and here available in the inline, while keeping nearly all of the reset of the representations hidden.

But no rush with that.

nessos / Streams

Hide representations while maintaining performance #31