scala / scala-library-next

backwards-binary-compatible Scala standard library additions
Apache License 2.0
69 stars 17 forks source link

Add a bytestring library containing a rope-like immutable data structure as a Scala module #137

Open mdedetrich opened 2 years ago

mdedetrich commented 2 years ago

The Akka codebase implemented ByteString which is a highly optimized datastructure for dealing with strings that are backed by bytes in an immutable rope like data structure. There is a strong argument for having this as a self contained community module since the problem that ByteString is solving is a common one that is experienced in many other Scala projects particularly webservers (and bringing in entirety of Akka just for ByteString is overkill). Having it as a Scala module can also mean it can be extended for other Scala platforms (Scala.js/scala-native) which I am planning to do if its accepted.

There is a sample project at https://github.com/mdedetrich/bytestring which already contains a general outline of the module. The project also contains an AUTHORS which details the notable past committers of the contained source files. The ByteString has been placed into the scala.collection.immutable package (one can also add a type alias in scala.lang.ByteString to scala.collection.immutable.ByteString but I think that is a bit aggressive since scala.lang is meant to be a sanctioned space for the Scala compiler).

In terms of stability, the ByteString datastructure has been contained within Akka in a binary compatible manner for a long period of time. It has also been optimized over the same period of time for one of the most performance sensitive projects in Scala and there is a benchmarking suite also to verify this. Support for the ByteString has also been extended to older Scala versions (my personal stance on this is that as long as its not an inconvenience to support older versions, for a Scala module there may be some benefit to this but removing support for older Scala versions when it becomes problematic is definitely desirable).

SethTisue commented 2 years ago

Would the plan to be eventually make it part of the Scala standard library, once that becomes open for additions again?

I want to note that we do already have two repos for things that are considered candidates for that:

But scala-collection-contrib is intentionally a grab bag with a liberal merge policy, so probably it's not a suitable library for lots of people to have on their classpaths.

scala-library-next might be more suitable, as the merge policy is stringent. (Stringent enough that very little has actually been merged there, to date, though not that many people have even tried.)

SethTisue commented 2 years ago

Another possible home for this is the Scala Center's rebooted successor to the old, abandoned Scala Platform idea. I'm not sure how far along that currently is.

SethTisue commented 2 years ago

As for the "make it a Scala module" idea:

It has never been especially clear what qualifies, or should or could qualify, as a "Scala module".

The most common thing that it has meant to date is "thing that used to be in the standard library, but we took it out" (scala-xml, scala-parser-combinators, scala-parallel-collections, scala-swing).

We also have scala-async, which is an example of something that is maintained by the Scala team at Lightbend and seems language-adjacent and standard-library-adjacent but whose future is cloudy. (Its development and maintenance have been commercially sponsored, and it's not clear that development and maintenance would continue if that funding dried up.)

And then there's scala-java8-compat and scala-collection-compat, which are pretty clearly standard-library-adjacent but seemed to make more sense as separate libraries with their own release schedules, rather than rolling them into stdlib.

All of this stuff dates back to an era when the Scala Center didn't exist yet, Scala 3.0 hadn't been released yet, and Typesafe/Lightbend's role in managing the overall future of Scala was larger. These days, the Lightbend team is more clearly a Scala 2 compiler/language team. These days I'd expect any up/down decision on something becoming a "Scala module" would ultimately be the Scala Center's.

cc @julienrf

mdedetrich commented 2 years ago

Would the plan to be eventually make it part of the Scala standard library, once that becomes open for additions again?

Sure we can do that, I am not against it. I just suggested it as a Scala module because I assume its easier to get it in but I would imagine even if it does land in official Scala standard library it still makes sense to have it as a module for the older Scala versions.

I want to note that we do already have two repos for things that are considered candidates for that:

https://github.com/scala/scala-collection-contrib https://github.com/scala/scala-library-next

But scala-collection-contrib is intentionally a grab bag with a liberal merge policy, so probably it's not a suitable library for lots of people to have on their classpaths.

scala-library-next might be more suitable, as the merge policy is stringent. (Stringent enough that very little has actually been merged there, to date, though not that many people have even tried.)

In general I would defer to whatever makes most sense. The intent is to have a high quality community maintained data structure that is official to some degree and whether it makes sense as a module or a module for older versions in Scala + Scala stdlib or any combination of this.

mdedetrich commented 2 years ago

Also on another note, I think that https://github.com/scala/scala-library-next/issues/138 would make more sense as a scala module where as ByteString has a stronger argument for being included into Scala stdlib (just due to the nature of whats being added). Also pinging @Ichoran to chime in if possible since you have experience Scala collections.

If there is merit in putting it into the Scala stdlib I can also make a pull request that does this, but I assume it would only be for Scala3 because Scala2 doesn't receive major additions?

Ichoran commented 2 years ago

I in general support the idea of having ByteString-like functionality in the standard library, but I don't have time to look through the implementation in detail right now. But it's at least as commonly-useful of a thing as PriorityQueue, so I think there's a pretty good argument to include it.

Regarding speed, a quick glance suggests that there may still be low-hanging fruit to speed things up. For instance, apparently a Java vararg Int method is commonly-enough used to be worth specifying, but it defers to an implementation using an iterator and IntIsIntegral instead of a direct array creation and a tight loop doing toByte on the input array. But, as I said, I don't have time to look in detail; that just leapt out at me while I was glancing through the code.

The collections have never been particularly minimal, so something very much like this is a good candidate for inclusion, I think. Whether it's sufficiently generic and collections-like I don't presently have time to judge.

Note that Scala does not presently have good ways to deal efficiently with collections of primitives. We should also probably consider whether this should be remedied next time we can change the core library, and if so, whether ByteString would belong better as a PrimitiveCollection than a regular one. (I would judge: yes.)

The new features in Scala 3 give us a lot more options for making fast and flexible primitive collections (transparent inline dispatch to optimized routines, especially). Then again, if Akka is going to support 2.x for a long time, it wouldn't really do for ByteString to change in that way.

mdedetrich commented 2 years ago

Scala contributors thread created at https://contributors.scala-lang.org/t/adding-akkas-bytestring-as-a-scala-module-and-or-stdlib/5967

som-snytt commented 2 years ago

the merge policy is stringent

or bytestringent

If there is a "platform" initiative, I hope it means a module such as this one can receive blessed status (easy to fetch as part of a bundle of some kind) but remain decoupled for purposes of releases.

Maybe a blessed module gets a signal boost when it needs a maintainer.

The counter-argument is that the onus is on the volunteer to publish or perish. Then future upstream contribution to standard lib should be "the easiest decision in the history of decisions," as a current ad campaign has it, instead of a matter of research or merge policy.

library-next is where PR stands for "parked research".

mdedetrich commented 2 years ago

On the note of release strategy, if ByteString gets released as a Scala module then I assume its possible for maintainers to make a new release when desired (I also assume this means external community maintainers which other Scala modules have).

mdedetrich commented 2 years ago

As a sidenote, @ansvonwa from https://contributors.scala-lang.org/t/adding-akkas-bytestring-as-a-scala-module-and-or-stdlib/5967/7 suggested an alternative implementation which is meant to fix quite a few shortcomings of the current implementation of ByteString (largely due to the fact that the current implementation is not a pure rope datastructure).

I personally have no qualms against this, question is just logistics (i.e. should it be implemented as a PR against https://github.com/mdedetrich/bytestring just to make sure its source compatible and reuse the existing sbt-jmh benchmarks or whether to do it separately).

ansvonwa commented 2 years ago

I think it would make sense to implement the TreeSeq (or however we call that superclass) in the scala repo, make sure it works for Vector and is fast and then use it for ByteString before merging, so we know if something is wrong. One question that came to my mind: If ByteString extends generic scala collection classes, do we get any (performance) problems with boxing/unboxing?

Btw, as a test, I've implemented a VectorTree tonight. There is no TreeSeq yet, but about half of the methods could be moved there when adding it.

mdedetrich commented 2 years ago

So the only thing that comes up here is the best method for supporting the currently maintained Scala2 versions (2.12 and 2.13). If we are extending a currently existing class, this means we would have to also add it to the Scala 2.12 branch. @lrytz Just responded on scala contributors at https://contributors.scala-lang.org/t/adding-akkas-bytestring-as-a-scala-module-and-or-stdlib/5967/9 and apparently Scala stdlib needs to be forwards and backwards compatible which means adding in a VectorTree would work (assuming its private) however ByteString/TreeSeq would have to be in a separate module since they are public (meaning that if you are not running ByteString either in latest Scala 2.13 or Scala3 it would have worse performance since it will use the older Vector). @SethTisue Do you know if adding an equivalent VectorTree change in Scala 2.12 branch is out of the question?

In summary at least for our use cases (and others) we would need ByteString to work with Scala 2.12, 2.13 and 3.

One question that came to my mind: If ByteString extends generic scala collection classes, do we get any (performance) problems with boxing/unboxing?

Cannot answer this one but I assume for a collection going into collection.immutable it needs to be collection like in style/consistency so we may have to accept the boxing problem unless its specialised over primitive types.

Ichoran commented 2 years ago

If we're trying tweaks to Vector, someone should go back and re-examine the work done when Vector was last redone for 2.13. A variety of implementations were examined, including some that were pretty high performance. Also, the work around an ordered immutable map is relevant.

The way to get around boxing issues is, for now, to have custom high-performance methods, as is the case for, say, LongMap. You can also encourage people to use the Stepper interface which is primitive-specialized for Java interop.

The whole thing needs to be rethought for Scala 3 at some point in the future.

It is also my intuition that ByteString should not be built on top of anything. The whole point of having it is for speed. Building it on top of anything is liable to impair performance. You might re-use an implementation, but things like having the actual chars in short arrays with the data structure saying how big the children are to left and right is really important for a high-performance rope-like string, and nothing else needs exactly that.

mdedetrich commented 2 years ago

Yeah now that I think about it, this makes more sense. Also makes it easier to release ByteString as a Scala module (given the fact that we can't add it as a Scala collection in stdlib right now apart from Scala3)

ansvonwa commented 2 years ago

A variety of implementations were examined, including some that were pretty high performance.

@Ichoran, do you have any links for that? Sounds really interesting, also for comparison. I only found collections-strawman which looks suspiciously similar to the pre-2.13 Vector.

The way to get around boxing issues is, for now, to have custom high-performance methods, as is the case for, say, LongMap.

But LongMap extends AbstractMap[Long, T]. Isn't that a problem? Or is that only problematic if the type at call site is not known to be LongMap (as in (longMap: AbstractMap[Long, T]).apply(1L))?

Ichoran commented 2 years ago

@ansvonwa - I believe I was thinking of https://github.com/scala/scala/pull/7146

SethTisue commented 2 years ago

These days, the Lightbend team is more clearly a Scala 2 compiler/language team. These days I'd expect any up/down decision on something becoming a "Scala module" would ultimately be the Scala Center's.

And on that ground, I'm going to transfer this out of scala-dev and over to scala-library-next.