scala / community-build

Scala 2 community build — a corpus of open-source repos built against Scala nightlies
Apache License 2.0
139 stars 59 forks source link

disk space regression on behemoth 2 #617

Closed SethTisue closed 7 years ago

SethTisue commented 7 years ago

I think something must have changed in the community build recently that is making the community builds chew up way more disk than they used to — as in, you can't even run the 2.12 and 2.13 builds once each without ending up out of disk.

needs investigation.

cunei commented 7 years ago

Note that the directories "~/.dbuild/cache*" collect all of the artifacts generated by dbuild, and are never garbage collected (it has been a "todo" in dbuild for a very long time). I don't know about the setup of the community build in particular, but you may want to check the size of those dirs every now and then, and zap them on occasion. That is especially true now that there are many projects and many different configurations involved.

retronym commented 7 years ago

The main culprit appears to be scala-*-integrate-community-build/target-0.9.9/project-builds.

--- /home/jenkins/workspace/scala-2.12.x-integrate-community-build/target-0.9.9/project-builds ---------------------------------------------------------------
                         /..
    2.9 GiB [##########] /fastparse-7e54355938440ef5e40886af071772172a7b7526
    1.4 GiB [####      ] /akka-more-1b1a0f6ddda7da64b652991d4d18cbf7f2b50329
    1.2 GiB [####      ] /play-core-1510118be95aeaebc40d3639b662bc5b674bf0d2
    1.0 GiB [###       ] /scalaz-76bd5764c6f8dd93666bd00579afd389b477dcdb
  907.9 MiB [###       ] /scalameta-3a5b419089b29bd8ba097282114548ce24784aaa
  767.4 MiB [##        ] /breeze-b0b808c2bca03de856dc31b3f75e1512cfa7a704
  742.7 MiB [##        ] /specs2-b035e90f082aa56ca8ba4380c8875e9ad3fc89df
  734.0 MiB [##        ] /akka-http-962f54885423d7e99115596dcea79518b1ba2fa8
  705.4 MiB [##        ] /akka-actor-1c695d12af79f1d3ac751aaa911b6f00051b5acb
  663.3 MiB [##        ] /unfiltered-dac954c1724d5856447807f11b63b3bbd621a089
  634.0 MiB [##        ] /scalikejdbc-0aaf0abd357f6f4cdfba4540f9c1c7cf8810125b
  633.5 MiB [##        ] /scalafix-cb58a33be9bae52783ee291c16dbadb8b967e6fa
  631.0 MiB [##        ] /scala-js-020d304f495e2f9a2ba3734f4f384ef8d469237d
  623.1 MiB [##        ] /monix-d99c847c5ade24c91e45c41239478e9b93c84e69
  607.3 MiB [##        ] /cats-84a80371921714c958a0d99bf2c963156f8702de
  601.8 MiB [##        ] /spire-87d759aa7fd265fb69c2c05dc38633229273cf91
  595.1 MiB [##        ] /scalatest-a48b2221995e91deb0ce628b653f636caec71266
  549.7 MiB [#         ] /sbt-librarymanagement-fb47e094ec8efb708200d55a5156846c04df8d97
  536.0 MiB [#         ] /play-webgoat-a11f1896e96c249eafe2d0e706fb105443af9c58
  505.8 MiB [#         ] /conductr-lib-bd61d089542d9844695c80737cd873743bedd2cb
  480.6 MiB [#         ] /twitter-util-f191b661d362603b251f2a55663d36815ee0be2f
  479.7 MiB [#         ] /play-ws-a4560867b8e0627d0cc6b09510c953876ef100fb

I've proposed a change to fastparse to reduce the space used by its tests. It checkouts a bunch of open source git repos as corpi to test its parsers, but neglected to do a shallow clone in one place. We could disable its tests in the meantime.

Could/should we just run clean as an extra command for each build so that we only need one populated target directory at a time?

What is stored in project-builds/**/.dbuild/{local-repo,topIvy}? These also seem to be space hogs. Could they be deleted after each project build without costing too much on a subsequent run of the community build?

retronym commented 7 years ago

Here's a snapshot of the disk usage generated by, and viewable with, ncdu.

SethTisue commented 7 years ago

TIL ncdu, slick! I'll switch to that from the du -ka . | sort -nr shell alias I've been using for 25 years

SethTisue commented 7 years ago

in 1f2859bf70e71ecfb453dc3035ed4dc0e39dc10f I temporarily switched the community build to use @retronym's branch of fastparse (green run: https://scala-ci.typesafe.com/job/scala-2.12.x-integrate-community-build/2075/consoleFull). hopefully that PR will be merged and we can unfork again

leaving the ticket open for now.

SethTisue commented 7 years ago

Could/should we just run clean as an extra command for each build so that we only need one populated target directory at a time

that is definitely worth considering.

traditionally we have deleted that stuff at the start of each run, rather than the end, in case we need the files in order to do postmortems on failures

in practice, I'd say I've used that capability only a handful of times over the past two years. if it were ever really needed we could do a new run on a branch where the cleanup command is removed/commented. anyway, most problems are reproducible by running the build locally, which is more convenient location for forensics & autopsies.

retronym commented 7 years ago

Is there any performance argument for leaving results of the previous community build in place? I seem to recall that project builds are somewhat incremental, but if we are changing the compiler each time there seems little prospect for avoiding rebuilds.

@cunei @SethTisue taking the idea a bit further, how about a mode in dbuild itself to clean up each project's directory (remove any .dbuild and target/** that aren't directly required by downstream builds) at the conclusion of each projects build? The goal would be to reduce the amount of disk needed to run the community build down from the current ~40GB to something more like 10GB.

SethTisue commented 7 years ago

I had left the target directory in place in order to facilitate postmortems, but in practice, I've rarely or never used that capability. if I want to do a postmortem, I usually try to reproduce the problem locally where it's more convenient to work with, then go from there. I've rarely or never needed to actually do the postmortem on the behemoth itself.

so I'd be fine with blowing away the target directory in the workspace at the end of the run rather than the beginning. (we'll want to be sure it gets blown away regardless of whether the run succeeded or failed, I think.)

removing ~/.dbuild is probably a no-go since it's shared by multiple jobs.

SethTisue commented 7 years ago

note that there is existing code to delete target, it just currently happens at the beginning of a run, not at the end

retronym commented 7 years ago

removing ~/.dbuild is probably a no-go since it's shared by multiple jobs.

Just to clarify, there are few different folders named .dbuild. I was hoping to purge projectA/**/.dbuild/** eagerly, after dbuild project builds that project. But I don't have a model for what parts (if any) of those directories are "outputs" and required for downstream projects.

SethTisue commented 7 years ago

this hasn't been a problem lately, optimistically closing