m2lines / ocean_emulators

https://m2lines.github.io/ocean_emulators/
Apache License 2.0
2 stars 1 forks source link

Naming Scheme for input/output files #25

Open jbusecke opened 1 week ago

jbusecke commented 1 week ago

Can we come up with a generic naming scheme for input output names?

What are the parameters we need to distinguish?

Input:

Output:

suryadheeshjith commented 1 week ago

I believe this name would be good: Modelname_epoch_train_dataset_eval_dataset_2D/3D.

We do not require the version of preprocessing because my workflow saves the source code at the point of training. It also saves the configuration of the training (number of GPUs/ machine etc.).

jbusecke commented 1 week ago

But there could be a case when the preprocessing was run with a different version than the training, right? We ought to capture both? But I think as long as we have the repo+version in each dataset, and then add the naming of the input dataset to the prediction, we have full provenance.

suryadheeshjith commented 1 week ago

Yepp, it actually stores the entire source code not just the training code. But sure we could add the hash. Could you provide a simple example of a file just to confirm my understanding?

jbusecke commented 1 week ago

I am writing the hash of the preprocessing into the input datasets attributes, so you could grab it from there. Ill show an example once I win the battle with dask to write out this damn dataset.