Loader should expose a snapshot dimension interface

mtauraso commented 2 months ago

The idea here is that in the situation where a researcher is loading HSC cutouts from a directory where there are significantly heterogenous image sizes, there should be some access to the file-scan data that the HSC data loader has.

This access will in theory allow the researcher to determine what an appropriate cutout size is to give the data loader. See: https://github.com/lincc-frameworks/fibad/issues/35

mtauraso commented 2 months ago

It is pretty easy to add to the data set or data loader as of PR https://github.com/lincc-frameworks/fibad/pull/49 . The data members that have size information are already on the dataset object, and if you have the data loader, its easy to get the dataset. We can pretty easily syntactic sugar pass-through methods so that the dataset/dataloader distinction isn't something the user has to think about.

Unfortunately getting access to the data loader from top level is a bit involved. Fibad.train (or .predict) runs a function that is essentially a script, which has the data loader as a local variable during the run() function, after which it goes out of scope.

I think in order to interrogate snapshot dimensions, @aritraghsh09 (or another notebook user) is going to run something like this:

myfibad = Fibad(config="stuff.toml")
myfibad.download()
histogram, buckets = myfibad.imageSizeHistogram() # or similar
pyplot.hist(histogram, buckets)

Note that Fibad.train(), which initializes the data loader hasn't been run at all. Even if we did initialize a dataloader in imageSizeHistogram() (or whatever interface method) the dataset/dataloader we create won't survive after that call, so the expensive file-scan it does will need to be re-done by .train() later on. There's a similar duplication of effort if you run Fibad.train() then Fibad.predict()

I'm thinking the best way forward is to create some fields on Fibad to hold the current model and data loader, and then move the train/predict/download verbs inside the Fibad class replacing the stubs that exist there now, so they have easy access to the Fibad fields.

This wouldn't change the functionality of the cli at all, since it already routes calls through the Fibad object, but it would cut out one of our earlier design ideas that fibad "verbs" correspond to a file with a run() method. Verbs would now correspond to a named function on the fibad object.

@drewoldag @aritraghsh09 Thoughts?

drewoldag commented 2 months ago

I see where you're coming from with this, it does make sense to avoid the expensive file-scan calls for sure. But I'm not confident that pulling in the definitions of the verb scripts into the Fibad object is the best way to maintain access to the instantiated data loader. I think I would be more in favor being able to rapidly instantiate a dataloader.

Would it be possible to cache a metadata file to skip the file-scan and enable analysis outside of fibad? I think this is very vaguely similar to what @aritraghsh09 was mentioning at our last sync about getting the filenames appended to the catalog table. The cached metadata isn't specifically what Aritra was asking for, but could augment it and serve to help the user understand the data set. Similar to what you called out in the last comment, the user might do this:

myfibad = Fibad(config="stuff.toml")
myfibad.download()  # once down completes a metadata file for the dataset is created.

# just take a look at the meta data
metadata = myfibad.get_dataset_metadata()
histo, buckets = myfibad.imageSizeHistogram()  # loads and reads from the metadata file
pyplot.hist(histogram, buckets)

# optionally run the following to recreate the metadata if, for instance, config params have changed
metadata = myfibad.get_dataset_metadata(reanalyze=True)

Additionally, a cached metadata file could support something similar to what huggingface offers for looking at data set statistics. Which also might be a nice thing to have for down the line. For example: https://huggingface.co/datasets/mwalmsley/gz_desi/viewer/default/train

Second point: Having the Fibad object dispatch to an externally defined download/train/predict run functions leaves the door open for alternative implementations of those functions. In the same way that we allow plug in models and dataloaders, we could imagine that there is a specific, non-default, train/predict/etc that a user would need. Supporting multiple ML frameworks may need different train/predict implementations for instance.

mtauraso commented 2 months ago

Ok, I'm going to try @drewoldag 's suggestion here, and pull this into the download manifest/fail-file work in order to make roughly the notebook vision above a reality without doing redundant file system scans.

lincc-frameworks / fibad

Loader should expose a snapshot dimension interface #36