ocurrent / current-bench

Experimental benchmarking infrastructure using OCurrent pipelines
Apache License 2.0
33 stars 17 forks source link

Be aggressive about removing all intermediate containers #478

Closed punchagan closed 7 months ago

punchagan commented 10 months ago

Autumn has been having space issues (and has been running out of inodes). It's not clear what is causing this, since docker system prune doesn't seem to help. We already run docker run with --rm argument. This commit changes the build also to remove intermediate containers in case of build failures.

art-w commented 10 months ago

Interesting :/ I poked around and ran more docker pruning (which didn't help much), but there's something I don't understand with the docker overlay2 directories:

$ df --inodes
Filesystem                  Inodes    IUsed     IFree IUse% Mounted on
/dev/mapper/vg_boot-data  13033472 11697303   1336169   90% /data
                                   ^^^^^^^^

$ ls /data/docker/overlay2 | wc -l
376

It looks like the overlay2 contain various copies of the opam-repository, which accounts for 6m of inodes (of the 11m used) :

$ cat <the-following> | awk '1{s+=$1} END{print s}'
6262728

$ find . -name 'opam-repository' | grep -v 'packages/opam-repository' | xargs -n 1 du --inodes --summarize
158132  ./8d9c7622c5da83b9fc97eb5b35d15c1a091c67f6ac510cb420d2af8723298e9a/diff/home/opam/opam-repository
163196  ./e1d20c07bfee4daede15d45acb60099946cd5d081454119349e4636d9676b2ad/diff/home/opam/opam-repository
158979  ./9ae3f8364f3224c219b23991db042949d22dbb99c635fffce4fd344a4971db81/diff/home/opam/opam-repository
162945  ./58f68ebcbcda9dc5c70fe24939517b5cefcc47f61469c3223269de29a11d00db/diff/home/opam/opam-repository
160626  ./65c845f4119446ba159aecb67f1e8d5e58384198bb8bc9a2c07abfd7b0d861c9/diff/home/opam/opam-repository
160237  ./ebd0d0da0a46d151489345565a9a637b4fc68414491e2d694f2b46108c1f9f9b/diff/home/opam/opam-repository
159933  ./526c93367c739f7dbfd63b445193442317ceb836ddbc209b9b83277c29f8b2aa/diff/home/opam/opam-repository
162455  ./8f3de902206ac0203291b6125079e5547b88eeac29ebb306676ec0b6f07a4001/diff/home/opam/opam-repository
161983  ./63f97ea97e5efc15af3472fa82df599d44da3573e264f2cafb144e0b278eb657/diff/home/opam/opam-repository
163196  ./4b7ff32e31e4670ba225d74d9c0c1fe19853b5d78a6ce15aa6267e040537673c/diff/home/opam/opam-repository
162945  ./fbd1f3b5ad1337a73a0a1fe773246ffe8663d481ffa9be4f6e9dcf067d402f66/diff/home/opam/opam-repository
160092  ./13b185ec4cb9d780a7c25051bbeb5da9392c553a68853e29579beaaf1cc4e7a7/diff/home/opam/opam-repository
160940  ./1a5c890c385ee1e6be5585b76decf4298969912c920b188cb02834d4048b323c/diff/home/opam/opam-repository
158725  ./3321809141f44bb75abb43f079c09cc2923bc07b20c1a621704515efe2138381/diff/home/opam/opam-repository
160235  ./62d8d3fa1f05c55689005b0bb5e5719c2f618c0d62dec3b862693b54f03a8558/diff/home/opam/opam-repository
7   ./tjoeyzh1257fnrjmb0b8sn990/diff/src/opam-repository
157643  ./371480b9ed57b53bdc8495f258cbb70fde20882676f9bb7933616063b01f0457/diff/home/opam/opam-repository
161763  ./5f050a5df47f336243e15ad60a1d1dc91cd2a1ab2a3ca97b77d5a7cabcd7a9ec/diff/home/opam/opam-repository
157116  ./b85ea590205f118781e1bbb448fa279f94dd03653f4f8f694ca09c304ecf8002/diff/home/opam/opam-repository
158521  ./457e26059d439457efc10332061c8f1d04abdbf3edac884580383e1502c070dc/diff/home/opam/opam-repository
161983  ./2ef48424c7374673593f919ab443319d089814b9cf0c99bdad69de562e4976b1/diff/home/opam/opam-repository
158254  ./97d6e62ca51a0b215ec23cd72e9984b832bfc1eff20b48622d365ee5b072c007/diff/home/opam/opam-repository
3702    ./lwxghksbh8d1skc4tmx5pp08i/diff/src/opam-repository
160098  ./2cc40f56e5f0d85085722fbebe3680144906f36bfa7009cec3ead1e17dc453de/diff/home/opam/opam-repository
158621  ./e9d9cfda2f7bc577d68de592189ddd016a5d2d42df891261a494f2b6798d7b77/diff/home/opam/opam-repository
156695  ./644cf526c2351a0f456b96bca35fb5851225d309fc4a4886e2696c942b1f09be/diff/home/opam/opam-repository
156221  ./6f968e34071e3ac4ce1c0c6a85f3b73bf749ef201cd59dd580c36945af485584/diff/home/opam/opam-repository
2   ./uiioy4lwo0msm688dkibvjs2r/diff/home/opam/.opam/5.0/.opam-switch/overlay/opam-repository
11  ./uiioy4lwo0msm688dkibvjs2r/diff/home/opam/.opam/5.0/.opam-switch/sources/opam-repository
162455  ./3c481b00d911fd12c6b182db597fca6b5669cc80cdb47a95da2de9314ba36c50/diff/home/opam/opam-repository
162202  ./ead3220a0dff5d2c7b49a33f057270227685c0efa6c9b0634b0f4f71337a7b91/diff/home/opam/opam-repository
158340  ./97a4973997c62f0d7bee6ed9e2a98cd099eb7d380324551e6ab751333e64f2d1/diff/home/opam/opam-repository
156337  ./54abd607b2242d36cd933221e447808b94bdbe4e4220a62e85aec821fe84ccc3/diff/home/opam/opam-repository
157272  ./3a471f284ba5275ffc2ccf2bee333e79785dec2adad72354a99aec636f0189d8/diff/home/opam/opam-repository
157388  ./18df9b7873e1bd86a22140b6ce25911a4b4d87f4633e08d5e7511e90cc78377c/diff/home/opam/opam-repository
158586  ./3b4a554a350318c47ca6b570570548871ff50a4cc912a8fc64bb93441bd5afe7/diff/home/opam/opam-repository
158240  ./6d8611846428aee1e7762ddf9823b8e7b11e3b1707fc383bc7b398a924d00428/diff/home/opam/opam-repository
155238  ./a21b015b4ea4ad8b48ffa54ffafc8e1dc5cc1f5374ac4268f5e52d46a084f764/diff/home/opam/opam-repository
159297  ./401dfb26517d289c1787d33d47436b4e52540148ff7bdb755a8a27f65606ebe1/diff/home/opam/opam-repository
160909  ./b96cf2942ca8a5de6e54bfc3b86fdb8bef6cca09c35a74785dcd8066ef14d260/diff/home/opam/opam-repository
158032  ./97cbe2e7cac83cfec1f11fd6308721e3d68b280aa03183fcf4c087d5ae9f60cf/diff/home/opam/opam-repository
161094  ./a3a0aa4a7b9b559d830a10fc628b1893bc5ac143ddbefdf873139921d69aa49a/diff/home/opam/opam-repository
156082  ./bbbd77d846a0c82a38467bf893e9c85af641e12acd31ecd64338a3c5dacd1105/diff/home/opam/opam-repository
36000   ./w6ewat90g5etzyw8zpa7w0t1y/diff/src/opam-repository

Why not, but some of those directories are a bit old:

drwx------ 3 root root 4.0K Mar 26 08:52 a21b015b4ea4ad8b48ffa54ffafc8e1dc5cc1f5374ac4268f5e52d46a084f764
drwx------ 3 root root 4.0K Mar 27 17:32 bbbd77d846a0c82a38467bf893e9c85af641e12acd31ecd64338a3c5dacd1105
drwx------ 3 root root 4.0K Apr  3 23:25 54abd607b2242d36cd933221e447808b94bdbe4e4220a62e85aec821fe84ccc3
drwx------ 3 root root 4.0K Apr  6 10:24 6f968e34071e3ac4ce1c0c6a85f3b73bf749ef201cd59dd580c36945af485584
...

And I can't find why docker retains them after all of our system pruning:

$ docker inspect $(docker ps --all -q) | grep a21b0
$ docker image inspect $(docker image ls --all -q) | grep a21b0

Any idea? I don't want to remove them manually as I don't understand why they are there :P Perhaps we could stop all of our dockers, run the system prune, and see if anything remains?

punchagan commented 10 months ago

Any idea? I don't want to remove them manually as I don't understand why they are there :P Perhaps we could stop all of our dockers, run the system prune, and see if anything remains?

It may be worth trying, but the service will be down for a while, before all the current-bench containers build.

@mtelvers and I tried to remove all of the overlay2 directories that seemed unused, but that caused build failures. But, later I moved out a handful of them using dates, and it did create some space for redeploying the production instance. There's a thread on #ci-dev that I'll DM you on slack, which might add more information.

mtelvers commented 10 months ago

I have setup my own current bench to try to understand what is happening. Building the system creates around 18GB of Docker images. Then for each run of the test project there seem to be three directories left over in the docker/overlay2 folder. Of interest, I note that docker prune system -af removed them.

mtelvers commented 10 months ago

Having experimented some more on my installation, I propose the following course of action:

1) Stop Docker, and edit /etc/docker/docker.json and set a new Docker root on the spare spinning disk in the machine 2) Deploy a empty copy of current-bench in the new Docker installation 3) Copy the Docker volumes over from the original installation (thereby taking a copy of previous jobs logs and importantly the database)

If this proves successful, we can move this installation on to the SSD. If it fails, edit docker.json and we are back where we are now.

mtelvers commented 10 months ago

See also https://github.com/ocaml/infrastructure/issues/70

punchagan commented 10 months ago

Having experimented some more on my installation, I propose the following course of action:

I've followed the steps proposed by you, and the docker service is now using the spinning disk. If everything works fine, overnight, we could plan on moving things to the SSD.

punchagan commented 7 months ago

The docker service now uses the SSD.