Open Innixma opened 4 days ago
I agree, though I would not advocate for different behavior on S3 vs locally on the AMLB level.
I would expect output_subdir(config)
to return the either benchmark/task/fold
directory (e.g., autogluon.test.test.local.20241110T031036/iris/0/
) or a dedicated subdirectory (e.g.,autogluon.test.test.local.20241110T031036/iris/0/output/
) and let the integration script organize their output how they want to from there. Personally, I would lean towards the latter, since that makes a clearer distinction between "custom" output specific to the integration, and the AMLB stored files.
I imagine usingautogluon.test.test.local.20241110T031036/iris/0/output/
over autogluon.test.test.local.20241110T031036/iris/0/
doesn't really complicate anything?
Yeah, I have no issues with the output
directory, it is fine to stay.
Re AWS mode logic: I don't mind too much the redundant path information so long as it is changed to the .../task/fold/output/*
format. It would ultimately be best if it didn't have a redundant part, but I can live with it because my logic already handles it.
s3://automl-benchmark-ag/
ec2/
autogluon_ag_1h8c_aws_20200822T032836/
aws_ag_1h8c_adult_0_autogluon/ # <--- Redundant?
...
As a follow-up, would it break AMLB's logic if I send a PR changing to .../task/fold/output/*
specifically for AutoGluon? Or would all of the frameworks need to be updated at the same time?
Really busy so I can't look around right now. From the top of my head, I can't think of a reason to keep the extra aws_ag_1h8c_adult_0_autogluon
. I imagine it's an artifact from the fact that each EC2 instance does its own benchmark invocation, but I am not sure. It's fine to get rid of it.
I would want the change to happen for all frameworks though, otherwise it just creates more work later and might lead to unexpected issues.
Currently if I create an output_dir for a given artifact in a given task, such as
iris
on fold 0, it produces what I consider a suboptimal path. This has led me for the past 3 years to have a fork of AMLB that has different saving logic for AutoGluon so the S3 folder format is easier to work with.Current Logic
Output:
Shortening for brevity:
Then if I want to save the file, it makes it even longer:
Current AWS Mode Logic
As an example, when the files are saved to S3, parsing becomes very complicated with the logic in mainline. If I simply want to concatenate all
leaderboard.csv
files in a benchmark run, while adding additionaldataset
,fold
andmethod
columns to differentiate between the runs, this becomes well over 200 lines of very complicated, very slow code.Here is a real example of a path I have in s3 using the mainline logic:
Breaking it down:
Writing logic given the run properties to find the right files is challenging given this format. The relative path of
leaderboard.csv
toresults.csv
for a given task result requires knowledge of the dataset and fold name, which isn't ideal. Especially when the dataset name in S3 can differ from what it is in thetask_metadata.csv
due to special characters.Proposal
Instead, I recommend the final output locations of the artifacts be:
This is much nicer IMO: It makes all of the artifacts related to a given method_task run (
autogluon.test.test.local.20241110T031036/iris/0/
) available in the same directory. It then becomes very easy to tell what artifacts are available from the run, rather then them being spread across many folders.AWS Mode
For AWS mode, I propose that the artifact save format should be:
(optionally can also remove the output/ dir if it is redundant)
This would make the relative path of all artifacts identical for all tasks, since they live in the same directory with the same structure regardless of the task name / fold / method.