Closed gkossakowski closed 6 years ago
seems tricky because we often enable/disable subprojects and/or tests, so I don't think we'd get an accurate count anyway. I don't plan to tackle it
I've gotten interested in this again, partly because I would really like to have a number to put in a blog post
the most accurate approach would be to inject some instrumentation into the compiler. I guess this would be a little compiler plugin that we'd add to all the projects that would count the lines and print a line of output with a sum, then to get the total we'd grep for those lines in the overall run log.
(but maybe we could also get a good-enough count by writing a script that greps the run log for lines like [akka-http] The following subprojects will be built in project akka-http: akka-parsing, akka-http-core, akka-http, akka-http-xml, akka-http-spray-json, akka-http-marshallers-scala, akka-http-testkit, akka-http-jackson, akka-http-marshallers-java, akka-http2-support, akka-http-tests, docs, root
. and then refine it a bit by including test code or not by looking for extra.run-tests: false
)
I'm thinking I ought to just go the compiler plugin route, I know the small-compiler-plugin drill pretty well. I'll need to check and see what dbuild offers for doing the injection.
@SethTisue have you considered running cloc
from the compiler plugin? You could easily access the set of sources compielr is about to compile from the plugin and rely on cloc's impl for skipping comments and counting code?
@gkossakowski that's a good idea. the plugin could just print the filenames, then we'd grep for those and pass them to cloc.
I decided I liked Grzegorz's suggestion of running cloc
directly from the plugin. I'm trying to get this done in the most expeditious manner possible and doing it this way means we don't need to clutter up the dbuild log with a lot of extra stuff.
here's the compiler plugin: https://github.com/sethtisue/cloc-plugin
working now on integrating it in this repo.
keeping the ticket open until we actually have a full count in hand. (we only count lines compiled during a particular run, so many runs will only have a partial count, if any cached builds are used.)
a recent run has 1.74 million total lines:
Lines of Scala code recompiled during this run only:
241654 scala-collections-laws
186789 akka-more
145424 scala-js
76366 scala-debugger
61218 monix
58929 scalatest
52611 akka-http
42804 breeze
41786 scala-refactoring
40499 scalaz
39930 twitter-util
39108 spire
36187 specs2
35664 scalikejdbc
32298 play-core
29479 shapeless
28017 zinc
27435 scalameta-2
27212 sbt
25517 slick
22113 cats
20639 scalameta-1
18214 akka-actor
17866 collection-strawman
16490 scalachess
14714 unfiltered
13693 sbt-librarymanagement
13668 ammonite
13259 scalariform
12733 scalapb
11153 scala-stm
9708 github4s
9545 scalaprops
9485 coursier
9303 play-json
8649 fs2
8489 sjson-new
8425 circe
8176 scalafmt
8045 scala-gopher
7985 scalastyle
6905 fastparse
6713 scalafix
6258 scala-java8-compat
5780 scala-swing
5770 parboiled2
5561 conductr-lib
5445 scalameter
5387 scalacheck
5259 argonaut
5187 scala-async
4888 scallop
4817 json4s
4560 jackson-module-scala
4519 kxbmap-configs
4376 doodle
4238 pureconfig
4222 lift-json
4030 meta-paradise
3953 play-ws
3834 monocle
3682 blaze
3600 ssl-config
3544 scodec-bits
3516 utest
3459 scalatags
3329 nyaya
3322 sbt-util
3269 scoverage
3236 scodec
2910 macro-paradise
2798 upickle
2783 algebra
2699 spray-json
2696 better-files
2516 scala-continuations
2508 gigahorse
2393 scalamock
2375 cachecontrol
2171 twirl
2143 jawn-0-11
2099 sbt-io
2075 pprint
2050 scala-parser-combinators
1988 sbinary
1946 scala-partest
1935 scalatex
1848 scopt
1833 mima
1799 scalacheck-shapeless
1798 jawn-0-10
1770 dispatch
1650 cats-effect
1519 scala-json-ast
1428 parboiled
1398 scala-records
1327 scalaj-http
1296 case-app
1246 paiges
1136 metaconfig-old
1072 genjavadoc
1064 lightbend-emoji
1033 akka-contrib-extra
1014 scala-logging
937 metaconfig-new
925 play-doc
909 simulacrum
886 fansi
884 twotails
870 atto
786 play-webgoat
759 minitest
731 pcplod
719 scala-xml-quote
641 acyclic
614 log4s
607 scala-ssh
571 macro-compat
538 geny
530 tut
497 sourcecode
492 circe-config
478 base64
475 kind-projector
400 http4s-websocket
281 scalapb-lenses
258 discipline
255 scalalib
211 machinist
146 jawn-fs2
111 semanticdb-sbt
107 sbt-testng
46 catalysts
1743918 TOTAL
The https://www.scala-lang.org/blog/2017/11/27/macros.html says that community build is ~3 million LoCs. where does the difference come form?
the "corpus" that Olafur mentions in that blog post may be derived in part from the community build, but isn't the same and is apparently larger. for example, he lists scanamo and lila as being included, but neither of them is in the community build.
@olafurpg is your corpus on GitHub somewhere?
I just realized we may be counting js/jvm cross-built files twice, which may explain the difference. I will double check tomorrow! 😅
I just realized we may be counting js/jvm cross-built files twice
you're not generating Scaladoc, are you? I was double-counting until I added a if (!global.settings.isScaladoc)
check
I manually added a few more projects. I'm on the phone now but I can send a link to the corpus and steps to reproduce when I'm back at the computer.
I run compile for all projects that either define 2.11 or 2.12 in their cross Scala version
fwiw, I consider it expected and normal that the community build would be smaller than other corpuses of open source Scala code. getting stuff in the community build is hard, for multiple reasons:
(you guys know these things, just stating them for the record)
OK I re-ran the analysis deduplicating jvm/js cross-built files and the occurrences of '\n'
(metric could definitely be refined) and they appear to be ~2 million loc instead of 3 million. Without jvm/js deduplication I can only count 2.3 million loc so I was doing something wrong when I first ran the analysis. I'll update the blog post to reflect this.
Two notable additions to my corpus are ornicar/lila (180k) and guardian/frontend (140k), which may help explain the difference with the compiler CB. I think the compiler CB also skips submodules in some projects.
Here is a full breakdown of the loc/project https://docs.google.com/spreadsheets/d/1btkCiF30Wb9MJti6LDc9og788XqXBKgwEhIdKb9aloc/edit?usp=sharing
Instructions to reproduce the analysis are in the readme here https://github.com/olafurpg/scala-experiments
Thanks for checking the numbers!
We are tracking number of lines compiled by community build by manually running
cloc
from time to time. It would be great to run cloc automatically.See also typesafehub/dbuild#133 and typesafehub/dbuild#142