scala / community-build

Scala 2 community build — a corpus of open-source repos built against Scala nightlies
Apache License 2.0
136 stars 60 forks source link

Track automatically how many lines of code we are compiling #59

Closed gkossakowski closed 6 years ago

gkossakowski commented 10 years ago

We are tracking number of lines compiled by community build by manually running cloc from time to time. It would be great to run cloc automatically.

See also typesafehub/dbuild#133 and typesafehub/dbuild#142

SethTisue commented 8 years ago

seems tricky because we often enable/disable subprojects and/or tests, so I don't think we'd get an accurate count anyway. I don't plan to tackle it

SethTisue commented 7 years ago

I've gotten interested in this again, partly because I would really like to have a number to put in a blog post

the most accurate approach would be to inject some instrumentation into the compiler. I guess this would be a little compiler plugin that we'd add to all the projects that would count the lines and print a line of output with a sum, then to get the total we'd grep for those lines in the overall run log.

(but maybe we could also get a good-enough count by writing a script that greps the run log for lines like [akka-http] The following subprojects will be built in project akka-http: akka-parsing, akka-http-core, akka-http, akka-http-xml, akka-http-spray-json, akka-http-marshallers-scala, akka-http-testkit, akka-http-jackson, akka-http-marshallers-java, akka-http2-support, akka-http-tests, docs, root. and then refine it a bit by including test code or not by looking for false)

I'm thinking I ought to just go the compiler plugin route, I know the small-compiler-plugin drill pretty well. I'll need to check and see what dbuild offers for doing the injection.

gkossakowski commented 7 years ago

@SethTisue have you considered running cloc from the compiler plugin? You could easily access the set of sources compielr is about to compile from the plugin and rely on cloc's impl for skipping comments and counting code?

SethTisue commented 7 years ago

@gkossakowski that's a good idea. the plugin could just print the filenames, then we'd grep for those and pass them to cloc.

SethTisue commented 6 years ago

I decided I liked Grzegorz's suggestion of running cloc directly from the plugin. I'm trying to get this done in the most expeditious manner possible and doing it this way means we don't need to clutter up the dbuild log with a lot of extra stuff.

here's the compiler plugin:

working now on integrating it in this repo.

SethTisue commented 6 years ago

keeping the ticket open until we actually have a full count in hand. (we only count lines compiled during a particular run, so many runs will only have a partial count, if any cached builds are used.)

SethTisue commented 6 years ago

a recent run has 1.74 million total lines:

Lines of Scala code recompiled during this run only:
   241654 scala-collections-laws
   186789 akka-more
   145424 scala-js
    76366 scala-debugger
    61218 monix
    58929 scalatest
    52611 akka-http
    42804 breeze
    41786 scala-refactoring
    40499 scalaz
    39930 twitter-util
    39108 spire
    36187 specs2
    35664 scalikejdbc
    32298 play-core
    29479 shapeless
    28017 zinc
    27435 scalameta-2
    27212 sbt
    25517 slick
    22113 cats
    20639 scalameta-1
    18214 akka-actor
    17866 collection-strawman
    16490 scalachess
    14714 unfiltered
    13693 sbt-librarymanagement
    13668 ammonite
    13259 scalariform
    12733 scalapb
    11153 scala-stm
     9708 github4s
     9545 scalaprops
     9485 coursier
     9303 play-json
     8649 fs2
     8489 sjson-new
     8425 circe
     8176 scalafmt
     8045 scala-gopher
     7985 scalastyle
     6905 fastparse
     6713 scalafix
     6258 scala-java8-compat
     5780 scala-swing
     5770 parboiled2
     5561 conductr-lib
     5445 scalameter
     5387 scalacheck
     5259 argonaut
     5187 scala-async
     4888 scallop
     4817 json4s
     4560 jackson-module-scala
     4519 kxbmap-configs
     4376 doodle
     4238 pureconfig
     4222 lift-json
     4030 meta-paradise
     3953 play-ws
     3834 monocle
     3682 blaze
     3600 ssl-config
     3544 scodec-bits
     3516 utest
     3459 scalatags
     3329 nyaya
     3322 sbt-util
     3269 scoverage
     3236 scodec
     2910 macro-paradise
     2798 upickle
     2783 algebra
     2699 spray-json
     2696 better-files
     2516 scala-continuations
     2508 gigahorse
     2393 scalamock
     2375 cachecontrol
     2171 twirl
     2143 jawn-0-11
     2099 sbt-io
     2075 pprint
     2050 scala-parser-combinators
     1988 sbinary
     1946 scala-partest
     1935 scalatex
     1848 scopt
     1833 mima
     1799 scalacheck-shapeless
     1798 jawn-0-10
     1770 dispatch
     1650 cats-effect
     1519 scala-json-ast
     1428 parboiled
     1398 scala-records
     1327 scalaj-http
     1296 case-app
     1246 paiges
     1136 metaconfig-old
     1072 genjavadoc
     1064 lightbend-emoji
     1033 akka-contrib-extra
     1014 scala-logging
      937 metaconfig-new
      925 play-doc
      909 simulacrum
      886 fansi
      884 twotails
      870 atto
      786 play-webgoat
      759 minitest
      731 pcplod
      719 scala-xml-quote
      641 acyclic
      614 log4s
      607 scala-ssh
      571 macro-compat
      538 geny
      530 tut
      497 sourcecode
      492 circe-config
      478 base64
      475 kind-projector
      400 http4s-websocket
      281 scalapb-lenses
      258 discipline
      255 scalalib
      211 machinist
      146 jawn-fs2
      111 semanticdb-sbt
      107 sbt-testng
       46 catalysts
  1743918 TOTAL
gkossakowski commented 6 years ago

The says that community build is ~3 million LoCs. where does the difference come form?

SethTisue commented 6 years ago

the "corpus" that Olafur mentions in that blog post may be derived in part from the community build, but isn't the same and is apparently larger. for example, he lists scanamo and lila as being included, but neither of them is in the community build.

@olafurpg is your corpus on GitHub somewhere?

olafurpg commented 6 years ago

I just realized we may be counting js/jvm cross-built files twice, which may explain the difference. I will double check tomorrow! 😅

SethTisue commented 6 years ago

I just realized we may be counting js/jvm cross-built files twice

you're not generating Scaladoc, are you? I was double-counting until I added a if (!global.settings.isScaladoc) check

olafurpg commented 6 years ago

I manually added a few more projects. I'm on the phone now but I can send a link to the corpus and steps to reproduce when I'm back at the computer.

olafurpg commented 6 years ago

I run compile for all projects that either define 2.11 or 2.12 in their cross Scala version

SethTisue commented 6 years ago

fwiw, I consider it expected and normal that the community build would be smaller than other corpuses of open source Scala code. getting stuff in the community build is hard, for multiple reasons:

(you guys know these things, just stating them for the record)

olafurpg commented 6 years ago

OK I re-ran the analysis deduplicating jvm/js cross-built files and the occurrences of '\n' (metric could definitely be refined) and they appear to be ~2 million loc instead of 3 million. Without jvm/js deduplication I can only count 2.3 million loc so I was doing something wrong when I first ran the analysis. I'll update the blog post to reflect this.

Two notable additions to my corpus are ornicar/lila (180k) and guardian/frontend (140k), which may help explain the difference with the compiler CB. I think the compiler CB also skips submodules in some projects.

Here is a full breakdown of the loc/project

Instructions to reproduce the analysis are in the readme here

gkossakowski commented 6 years ago

Thanks for checking the numbers!