openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
391 stars 130 forks source link

Add toggle to disable `results/backup/` files in AWS mode #586

Open Innixma opened 10 months ago

Innixma commented 10 months ago

I am running large-scale benchmarks in AWS mode and finding that there are files being saved in results/backup/ that take up significant space (leading to >1 TB of files that cause the host machine to run out of disk during the benchmark run).

Where in the code are these files being specified and how can I disable them? Are they necessary for anything? I would assume not.

The problem is that each file in backup is concatenating all the results of the benchmark together into a CSV file, causing it to take N^2 space where N is the number of instances being spun up (and in my case, N > 20,000).

As an example:

-rw-rw-r-- 1 ubuntu ubuntu 108684498 Aug 24 17:34 results.20230824T173419.csv
-rw-rw-r-- 1 ubuntu ubuntu 108687827 Aug 24 17:34 results.20230824T173421.csv
-rw-rw-r-- 1 ubuntu ubuntu 108690007 Aug 24 17:34 results.20230824T173440.csv
-rw-rw-r-- 1 ubuntu ubuntu 108694343 Aug 24 17:34 results.20230824T173442.csv
-rw-rw-r-- 1 ubuntu ubuntu 108696534 Aug 24 17:34 results.20230824T173445.csv
-rw-rw-r-- 1 ubuntu ubuntu 108700835 Aug 24 17:34 results.20230824T173447.csv
-rw-rw-r-- 1 ubuntu ubuntu 108702942 Aug 24 17:34 results.20230824T173451.csv
-rw-rw-r-- 1 ubuntu ubuntu 108705127 Aug 24 17:35 results.20230824T173500.csv
-rw-rw-r-- 1 ubuntu ubuntu 108709478 Aug 24 17:35 results.20230824T173506.csv
-rw-rw-r-- 1 ubuntu ubuntu 108711667 Aug 24 17:35 results.20230824T173509.csv
-rw-rw-r-- 1 ubuntu ubuntu 108715990 Aug 24 17:35 results.20230824T173512.csv
-rw-rw-r-- 1 ubuntu ubuntu 108718171 Aug 24 17:35 results.20230824T173516.csv
-rw-rw-r-- 1 ubuntu ubuntu 108720361 Aug 24 17:35 results.20230824T173521.csv
-rw-rw-r-- 1 ubuntu ubuntu 108722544 Aug 24 17:35 results.20230824T173524.csv
-rw-rw-r-- 1 ubuntu ubuntu 108724739 Aug 24 17:35 results.20230824T173526.csv
-rw-rw-r-- 1 ubuntu ubuntu 108726929 Aug 24 17:35 results.20230824T173528.csv
-rw-rw-r-- 1 ubuntu ubuntu 108729124 Aug 24 17:35 results.20230824T173531.csv
-rw-rw-r-- 1 ubuntu ubuntu 108731314 Aug 24 17:35 results.20230824T173546.csv
-rw-rw-r-- 1 ubuntu ubuntu 108733514 Aug 24 17:35 results.20230824T173549.csv
-rw-rw-r-- 1 ubuntu ubuntu 108735701 Aug 24 17:36 results.20230824T173558.csv
-rw-rw-r-- 1 ubuntu ubuntu 108737904 Aug 24 17:36 results.20230824T173610.csv
-rw-rw-r-- 1 ubuntu ubuntu 108740102 Aug 24 17:36 results.20230824T173627.csv
-rw-rw-r-- 1 ubuntu ubuntu 108743879 Aug 24 17:36 results.20230824T173633.csv
-rw-rw-r-- 1 ubuntu ubuntu 108746069 Aug 24 17:36 results.20230824T173635.csv
-rw-rw-r-- 1 ubuntu ubuntu 108748269 Aug 24 17:36 results.20230824T173638.csv
-rw-rw-r-- 1 ubuntu ubuntu 108752563 Aug 24 17:36 results.20230824T173643.csv

There are around 10 of these files being written a minute, each one larger than the last (currently 108MB per file), meaning 1 GB of disk space is being taken up a minute.

PGijsbers commented 10 months ago

The backup should be made in amlb/results.py#L112, called from the Benchmark, if I am not mistaken. Having an option is call save with append=True should be all it takes.

In the meantime, you could disable results.global_save. Then no results/results.csv will be written at all which should also mean no backup is made.