Define steps to reach a Go / No-Go decision on adopting this strawman

Ichoran commented 7 years ago

This proposal began as a strawman, a possible new design for collections. But we're starting to work on it pretty seriously. To avoid wasting our effort, we should define those things we need to check before we can decide whether the strategy is a success and we should attempt to replace the original collections library with this one. (We then have another go/no-go decision about whether we did well enough for us to be able to do that.)

I think we've already demonstrated these things adequately:

Superficially identical syntax for common operations
A reasonable hierarchy exists for doing things this way (may need tweaking, but it won't just fail)
Specialization can be added in where we want it
Both mutable and immutable base collections behave sensibly
Wrapping arrays as a strawman-collection works fine

Maybe I've just missed it but if not we should come up with a complete list of the things we want to check before we dive in with full effort. These occur to me right now (assuming I haven't missed any of the work which has already been done, which is possible):

Can we support String operations?
Can we implement Map elegantly? Are maps regular collections, or their own thing which gets boxed as a collection?
How do we handle things like BitSet#map and TreeMap#map?
Are there any hitches for weird collection types like PriorityQueue?

In particular, we should try to find the places where CBF is really pulling its weight and see what we can do in this system to copy the behavior, or at least do something sound.

odersky commented 7 years ago

Can we support String operations?

String is already supported (see javaSupport.scala).

Can we implement Map elegantly? Are maps regular collections, or their own thing which gets boxed as a collection?

I think Maps need to be Iterables of pairs, as before. But we should indeed do the exercise to add them.

How do we handle things like BitSet#map and TreeMap#map?

I would expect the same way as String/Array. That is, overload the map operation to have a fully generic one which returns a Set or a Map and a constrained one which returns a BitSet or a TreeMap. But yes, we should try that out.

Are there any hitches for weird collection types like PriorityQueue?

Yes, good question.

I think next step should be to add sets and maps with sepcial focus on irrgular cases such as TreeMap. Then we will see clearer. Who wants to get started on this?

odersky commented 7 years ago

I think another important element for a decision is to do some preliminary benchmarking. If we are in the ballpark of current collections that part would be a go, otherwise we'd have to revise.

We don't need to micro-optimize everything yet, just do a decent job on the critical parts that are exercised by the benchmarks.

lihaoyi commented 7 years ago

Superficially identical syntax for common operations

I think most people would agree on this, but is it worth defining exactly what common means? Either with a hardcoded list of Type#method pairs, or a numeric threshold based on # of usages (whether static or dynamic) in the community build? Or something else?

Ichoran commented 7 years ago

That's a good point, @lihaoyi. I was working on an ASM-based bytecode scanning tool (so we could scan jars without caring whether we could compile them, whether classes were compatible, whether they were Scala or Java, etc.) to gather statistics like these. I should resurrect it, or we should find something like it, and use it to assist defining and detecting what is common.

odersky commented 7 years ago

@olafurpg has a comprehensive set of projects which he can scan with scala meta. Right now the supported queries are mostly syntactic, however.

odersky commented 7 years ago

It's an interesting question how to measure commonality of operations. One obvious (and good!) measure is community build and other common codebases. But that's clearly more oriented towards library providers than library users. And in a sense the most important thing is not to upset the end-users - providers can and will adapt but in particular casual end users will not have the same amount of patience.

I thought one other measure could be usages in tutorials on Scala and related technologies. If we break those, we will confuse past and future users. The problem is that not all tutorials are freely accessible. But quite a few of them are: for instance our Coursera courses (with 600K+ registered users), our "Programming in Scala" book and quite a few other books, or Twitter's Scala school. It would be really cool if we could compile an index of these materials as well.

lihaoyi commented 7 years ago

And in a sense the most important thing is not to upset the end-users - providers can and will adapt but in particular casual end users will not have the same amount of patience.

Should we focus on code already-written by end users, instead of tutorials targeting new users? Breaking books and tutorials means they will be out of date, but as you said tutorial-writers are "providers" rather than end-users and can release new versions of their books or tutorials, just as library-providers can release new versions of their libraries.

Almost by definition, the new users don't know anything yet, and if we change the language and tutorials the endless stream of new users will learn the new thing.

On the other hand, if someone's maintaining a million-LOC Scala codebase, that would actually be a pain to migrate depending on how breakage occurs. We might not be able to persuade a large company to give us a dump of their massive codebase, but if someone wrote a simple code-analysis tool that plugged into the compiler or ran over bytecodes to dump a txt file containing aggregate statistics of calls to scala.collection.* methods I'd bet we could persuade some real companies to run it on their codebase.

Such aggregate statistics shouldn't contain any sensitive information or IP (who cares if your company uses List#scanRight a lot???) and a company may be willing to make them available purely out of self interest: nudging the 2.13 decision-making process to reduce the amount of pain they, the end users, will see upgrading when 2.13 lands. Which seems to be what we want as well.

odersky commented 7 years ago

@lihaoyi I was focusing on tutorial writers for two reasons:

tutorials tend to be longer-lived than libraries, with a slower rate of innovation.
tutorials influence the coding patterns of people learning with them. That way, they are a good proxy for the large number of programmers who use Scala but are not contributors to a highly visible software package.

That said, I like the statistics tool idea.

olafurpg commented 7 years ago

I made a small experiment to find out what parts of the collections library is used in open-sources Scala libraries.My methodology:

Pick top 1.000 Scala projects according to scaladex.
- Here are the top 10 libraries, use pagination to see more results.
- The ranking takes several factors into account, including a simplified "page rank" (how many artifacts depend on this library, even for test dependencies) and Github stars/contributors (if available).
Download the latest version of all published artifacts for those projects.
- 4.298 jar files, 3.8 GB
- 819.035 classfiles
For each classfile, use javap -c and grep for "scala/collection"

Early results:

1.046.449 total results, this includes new, invoke{virtual,special,interface}, checkcast
692.361 invocations
75.709 invocations to scala/collection/mutable/StringBuilder.append
1.720 invocations to methods named view, of which ~900 are to scala/collection/SeqLike$class.view:(Lscala/collection/SeqLike;)Lscala/collection/SeqView; and scala/collection/IterableLike$class.view:(Lscala/collection/IterableLike;II)Lscala/collection/IterableView;
25.093 calls to .map
6.240 calls to .flatMap
4.242 calls to .filter
3.734 calls to .withFilter
2.578 calls to .headOption
31.262 calls to .head 😭

The dataset is available here, feel free to grep/sed/awk as you please: https://drive.google.com/file/d/0B5BBsTPBxGcMdl9zZG1sSnZ6Qlk/view?usp=sharing

olafurpg commented 7 years ago

The maven coordinates for the analyzed artifacts are here: artifacts.zip Some of them are invalid and don't resolve to actual published artifacts. The file is autogenerated from json responses from scaladex.

odersky commented 7 years ago

Interesting. I was surprised that view is that common. Definitely good that we cater for it instead of removing it.

What about breakOut? That's one of the things we cannot support anymore.

Ichoran commented 7 years ago

Thanks for gathering this, @olafurpg -- very instructive!

@odersky - 270 uses of breakOut according to a quick grep. This is in the general range of mapResult (on builders; 396 usages) and enqueue (from Queue, 244 usages).

fommil commented 7 years ago

I've found breakout to be incredibly useful for fixing performance issues related to object churn.

Almost every call of mapValues (returns a view, or at least a lazy thing) has been an accident in my experience with huge performance penalty.

fommil commented 7 years ago

btw, please add ensime to your test list. It's a big codebase by github standards and won't show up in scaladex because it's an app (not a library) and hence not externally referenced.

olafurpg commented 7 years ago

I extended the dataset to include info about the origin of each call-site. This makes it possible to see where these points of interest appear. Here's the data in a tab-separated format with columns 1. project (first 2 package names) 2. classfile name 3. scala.collection callsite:
result5.zip

Here's a visualization for the usage breakOut grouped by project: https://docs.google.com/spreadsheets/d/14jkp9wpVJXD2kPKZWM1vWbc1gt8HRE5NCBJbREbrfwQ/edit?usp=sharing

The spreadsheet is open for anyone to edit, feel free to add your own analysis in a separate sheet.

@fommil Ensime is in the list and appears to be the fourth biggest user of breakOut

odersky commented 7 years ago

@fommil The strawman proposal can easily obtain the performance gains obtained from breakout by explicit view/fromIterable calls. But it will require a rewrite of every occurrence of breakOut. So this is not so much a matter of principle but of degree. How widespread will be the burden of rewrites?

fommil commented 7 years ago

We probably overuse breakOut to be honest (it's come up so much in performance profiling that I now add it by instinct when I'm coding something that will be called a lot). I know one or two of our files that are heavy churn creators (anything around classfile indexing) so I can just profile them again under the new collections lib and do whatever is necessary.

In terms of proprietary usage of breakOut, I have tended to be a bit more surgical about it, so hopefully a grep and a cookbook is all that is needed.

ENSIME might struggle to create a <=2.12 and 2.13+ version of the code so maybe when we support scala 2.13 we will have to drop support for older scala. That's ok, we want to freeze the API in 2.0 anyway in the hopes of dropping 2.10.

szeiger commented 7 years ago

At the Scala team meeting last Friday we discussed the option of having a compatibility library for cross-compilation between 2.12 and 2.13 collections. This could not possibly be a one-step process (making 2.12 collections looks like 2.13 or vice versa) because of shared names (like view) with different semantics but possibly get close to 2.13 semantics with different names on top of 2.12 and an aliasing shim for 2.13. This is something we could investigate in Q2 when we should also work on automatic rewrites with Scalafix and testing of some community libraries (e.g. ScalaTest) with the new library types before porting all implementations.

szeiger commented 7 years ago

@odersky I actually find 1720 view calls reassuring. That's less than 2 calls per project on average.

olafurpg commented 7 years ago

Here is the the usage of view grouped by project https://docs.google.com/spreadsheets/d/14jkp9wpVJXD2kPKZWM1vWbc1gt8HRE5NCBJbREbrfwQ/edit#gid=1958287660

It seems over 50% of usages come from only 8 projects. Over 11% comes from scala/collection itself.

smarter commented 7 years ago

I see a count of 9 for dotty/tools but only one occurence of .view in the sources of dotty in compiler/src/dotty/tools, any idea why?

olafurpg commented 7 years ago

@smarter Here are the classfiles associated with of those view calls in dotty/tools: https://gist.github.com/olafurpg/fbc42ad1a139c386b9f28104b0c53a68

Here's an extension of result5.zip with two columns: 1. original classfile (omitted in result5.zip) and 2. scala/collection call-site collection6.zip

fommil commented 7 years ago

almost all uses of .mapValues I've ever seen has been an accidental use of views, does it show up at the binary level?

smarter commented 7 years ago

@smarter Here are the classfiles associated with of those view calls in dotty/tools: https://gist.github.com/olafurpg/fbc42ad1a139c386b9f28104b0c53a68

I looked into it and all but one of these occurences of view are not present in sources but occur because we extend some trait from the standard library, so I suspect the real usage of view is much smaller than what you would expect based on the stats you collected.

smarter commented 7 years ago

Maybe you could try filtering out all usages of view from *Like classes, since they're unlikely to be called by users?

DarkDimius commented 7 years ago

It might be good idea to filter out synthetic bridge methods

smarter commented 7 years ago

It might be good idea to filter out synthetic bridge methods

This seems to be already done, e.g. in Jar.class there's really 4 calls to view, the two that are in the gist provided by @olafurpg are in non-bridge methods:

 public java.lang.Object view();
    descriptor: ()Lscala/collection/IterableView;
    flags: ACC_PUBLIC
--
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokestatic  #243                // Method scala/collection/IterableLike$class.view:(Lscala/collection/IterableLike;)Lscala/collection/IterableView;
         4: areturn
      LocalVariableTable:
--
    Signature: #150                         // ()Ljava/lang/Object;

  public scala.collection.IterableView<java.util.jar.JarEntry, scala.collection.Iterable<java.util.jar.JarEntry>> view(int, int);
    descriptor: (II)Lscala/collection/IterableView;
    flags: ACC_PUBLIC
--
         1: iload_1
         2: iload_2
         3: invokestatic  #247                // Method scala/collection/IterableLike$class.view:(Lscala/collection/IterableLike;II)Lscala/collection/IterableView;
         6: areturn
      LocalVariableTable:
--
        line 36: 0

  public scala.collection.TraversableView view(int, int);
    descriptor: (II)Lscala/collection/TraversableView;
    flags: ACC_PUBLIC, ACC_BRIDGE, ACC_SYNTHETIC
--
         1: iload_1
         2: iload_2
         3: invokevirtual #804                // Method view:(II)Lscala/collection/IterableView;
         6: areturn
      LocalVariableTable:
--
        line 36: 0

  public scala.collection.TraversableView view();
    descriptor: ()Lscala/collection/TraversableView;
    flags: ACC_PUBLIC, ACC_BRIDGE, ACC_SYNTHETIC
--
      stack=1, locals=1, args_size=1
         0: aload_0
         1: invokevirtual #807                // Method view:()Lscala/collection/IterableView;
         4: areturn
      LocalVariableTable:

DarkDimius commented 7 years ago

Those are trait forwarders and apparently they don't get ACC_SYNTHETIC ACC_Bridge I'd propose to filter them by discarding calls using invokestatic that have aload_0 in previous 5 instructions. Additionally, those 5 previous instructions shound't contain any invoke-s.

olafurpg commented 7 years ago

@fommil I suspect not. At least, mapValues doesn't appear next to view calls in the corpus. Here's the breakdown for mapValues https://docs.google.com/spreadsheets/d/14jkp9wpVJXD2kPKZWM1vWbc1gt8HRE5NCBJbREbrfwQ/edit#gid=26040485

@smarter That is a good observation. There are 26 instances of *Like.class with .view calls, of which 24 are from scala/collection.

@DarkDimius the corpus doesn't contain the leading instructions. I guess asm is better suited than bash+javap for more advanced queries. Let me know if you want the corpus of jars, it's ~3.5gb.

DarkDimius commented 7 years ago

@olafurpg, if you'll be filtering out Like-s I'd only filter out "IterableLike$class", i.e. with "$class", other calls inside non-synthetic methods should be valild calls.

szeiger commented 6 years ago

Closing since we've merged the strawman into 2.13.x.

scala / collection-strawman

Define steps to reach a Go / No-Go decision on adopting this strawman #21