openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
409 stars 135 forks source link

Make output artifacts live in the task hierarchy they were produced in #647

Open Innixma opened 2 weeks ago

Innixma commented 2 weeks ago

Currently if I create an output_dir for a given artifact in a given task, such as iris on fold 0, it produces what I consider a suboptimal path. This has led me for the past 3 years to have a fork of AMLB that has different saving logic for AutoGluon so the S3 folder format is easier to work with.

Current Logic

from frameworks.shared.callee import output_subdir
leaderboard_output_directory = output_subdir("leaderboard", config)
info_output_directory = output_subdir("info", config)

Output:

/home/ubuntu/workspace/code/automlbenchmark/results/autogluon.test.test.local.20241110T031036/leaderboard/iris/0
/home/ubuntu/workspace/code/automlbenchmark/results/autogluon.test.test.local.20241110T031036/info/iris/0

Shortening for brevity:

autogluon.test.test.local.20241110T031036/leaderboard/iris/0
autogluon.test.test.local.20241110T031036/info/iris/0

Then if I want to save the file, it makes it even longer:

autogluon.test.test.local.20241110T031036/leaderboard/iris/0/leaderboard.csv
autogluon.test.test.local.20241110T031036/info/iris/0/info.json

Current AWS Mode Logic

As an example, when the files are saved to S3, parsing becomes very complicated with the logic in mainline. If I simply want to concatenate all leaderboard.csv files in a benchmark run, while adding additional dataset, fold and method columns to differentiate between the runs, this becomes well over 200 lines of very complicated, very slow code.

Here is a real example of a path I have in s3 using the mainline logic:

s3://automl-benchmark-ag/ec2/2020_08_21/autogluon_ag_1h8c_aws_20200822T032836/aws_ag_1h8c_adult_0_autogluon/output/leaderboard/adult/0/leaderboard.csv

Breaking it down:

s3://automl-benchmark-ag/
ec2/
2020_08_21/
autogluon_ag_1h8c_aws_20200822T032836/
aws_ag_1h8c_adult_0_autogluon/
output/
leaderboard/
adult/
0/
leaderboard.csv

Writing logic given the run properties to find the right files is challenging given this format. The relative path of leaderboard.csv to results.csv for a given task result requires knowledge of the dataset and fold name, which isn't ideal. Especially when the dataset name in S3 can differ from what it is in the task_metadata.csv due to special characters.

Proposal

Instead, I recommend the final output locations of the artifacts be:

autogluon.test.test.local.20241110T031036/iris/0/leaderboard.csv
autogluon.test.test.local.20241110T031036/iris/0/info.json

This is much nicer IMO: It makes all of the artifacts related to a given method_task run (autogluon.test.test.local.20241110T031036/iris/0/) available in the same directory. It then becomes very easy to tell what artifacts are available from the run, rather then them being spread across many folders.

AWS Mode

For AWS mode, I propose that the artifact save format should be:

s3://automl-benchmark-ag/
ec2/
2020_08_21/
autogluon_ag_1h8c_aws_20200822T032836/
adult/
0/
output/
leaderboard.csv

(optionally can also remove the output/ dir if it is redundant)

This would make the relative path of all artifacts identical for all tasks, since they live in the same directory with the same structure regardless of the task name / fold / method.

PGijsbers commented 2 weeks ago

I agree, though I would not advocate for different behavior on S3 vs locally on the AMLB level.

I would expect output_subdir(config) to return the either benchmark/task/folddirectory (e.g., autogluon.test.test.local.20241110T031036/iris/0/) or a dedicated subdirectory (e.g.,autogluon.test.test.local.20241110T031036/iris/0/output/) and let the integration script organize their output how they want to from there. Personally, I would lean towards the latter, since that makes a clearer distinction between "custom" output specific to the integration, and the AMLB stored files.

I imagine usingautogluon.test.test.local.20241110T031036/iris/0/output/ over autogluon.test.test.local.20241110T031036/iris/0/ doesn't really complicate anything?

Innixma commented 2 weeks ago

Yeah, I have no issues with the output directory, it is fine to stay.

Re AWS mode logic: I don't mind too much the redundant path information so long as it is changed to the .../task/fold/output/* format. It would ultimately be best if it didn't have a redundant part, but I can live with it because my logic already handles it.

s3://automl-benchmark-ag/
ec2/
autogluon_ag_1h8c_aws_20200822T032836/
aws_ag_1h8c_adult_0_autogluon/  # <--- Redundant?
...

As a follow-up, would it break AMLB's logic if I send a PR changing to .../task/fold/output/* specifically for AutoGluon? Or would all of the frameworks need to be updated at the same time?

PGijsbers commented 2 weeks ago

Really busy so I can't look around right now. From the top of my head, I can't think of a reason to keep the extra aws_ag_1h8c_adult_0_autogluon. I imagine it's an artifact from the fact that each EC2 instance does its own benchmark invocation, but I am not sure. It's fine to get rid of it.

I would want the change to happen for all frameworks though, otherwise it just creates more work later and might lead to unexpected issues.