mlcommons / GaNDLF

A generalizable application framework for segmentation, regression, and classification using PyTorch
https://gandlf.org
Apache License 2.0
150 stars 78 forks source link

Added huggingface integration #892

Open pranayasinghcsmpl opened 2 weeks ago

pranayasinghcsmpl commented 2 weeks ago

Fixes #727

Proposed Changes

Checklist

github-actions[bot] commented 2 weeks ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sarthakpati commented 2 weeks ago

I am assuming this PR makes #880 redundant.

sarthakpati commented 2 weeks ago

This needs fairly robust documentation regarding usage, including references to the various HF commands used internally [ref]. I would recommend creating a separate subsection under the deployment header of usage called Deplying Models to HuggingFace Hub and then adding details. It would be awesome if you could showcase an example in the form of a tutorial as well!

pranayasinghcsmpl commented 2 weeks ago

Sure, I'll get started on that.

pranayasinghcsmpl commented 2 weeks ago

I am assuming this PR makes #880 redundant.

yeah, it does

sarthakpati commented 4 days ago

Hi @Wauplin @NielsRogge, would you mind taking a look at this PR to see it makes sense from the HF perspective?

Wauplin commented 4 days ago

Hi @sarthakpati, thanks for the ping. I've reviewed this PR + had a look to https://github.com/mlcommons/GaNDLF/issues/727 https://github.com/mlcommons/GaNDLF/pull/880 https://github.com/mlcommons/GaNDLF/pull/878. To be honest, I'm not sure to understand the added value of wrapping huggingface-cli. For what I understand, users will type gandlf hf download ... instead of huggingface-cli download ... but then the behavior will be exactly the same? It that's the case, I don't think it's worth adding it. The cost of maintenance (e.g. adding new commands when huggingface_hub gets updated) seems to outweigh the benefits in my opinion. If gandlf usage is purely CLI-based, I understand the need for a CLI. In that case, I would either:

Overall, I'm new to the GaNDLF codebase and don't want to assume how users are using it. But if you have some script based usage (e.g. users importing GaNDLF classes in their code), then I think that a tight integration as suggested in https://github.com/mlcommons/GaNDLF/pull/878 by @nielsrogge makes more sense. Being able to load / save models on the HF Hub directly is usually a much appreciated feature.

sarthakpati commented 3 days ago

Hi @Wauplin,

Thank you for your feedback. I think the second option you suggest makes a lot of sense, since we definitely need to have added metadata. At the very least, we need to have the version of GaNDLF the model was trained on and (if applicable) the accompanying git hash. And having gandlf as library_name would be perfect.

@pranayasinghcsmpl - does this make sense?

Wauplin commented 3 days ago

Agree!

Btw, PR from @nielsrogge https://github.com/mlcommons/GaNDLF/pull/878/files would help doing that. In the Mixin, you can define metadata that will be added to the model card. Here is an integration guide about it: https://huggingface.co/docs/huggingface_hub/v0.24.0.rc0/en/guides/integrations#model-card. Once that's integrated, the gandlf upload and gandlf download command will simply be wrappers around the from._pretrained and .push_to_hub methods!

pranayasinghcsmpl commented 3 days ago

Hi @Wauplin, the GaNDLF cli exports other files apart from just the model like logs, csvs, onnx files etc. The gandlf cli has various commands to perform different stages of model development. So if an user uses gandlf run to train a model they can then use gandlf collect-stats to perform model evaluation which will generate more files containing plots & stats. Also to run inference on a gandlf model the user requires the config file that was used during training and all the files exported after training the model.

Therefore, it is important to be able to upload a lot of different files that are generated by different GaNDLF commands. Along with that the user might want to include the training data that was used.

If we can upload all these files using the Mixin, we can just proceed with your suggestions.

sarthakpati commented 3 days ago

@pranayasinghcsmpl - I am unsure if uploading all the files to HF is required. I would imaging the model files (*.pth.tar, *.onnx, *.xml, *.bin) would be the most important, along with the model configuration itself. These combined with the metadata needed for HF hub to be able to parse the uploads as being gandlf centric (using the library_name key) should suffice.

Wauplin commented 3 days ago

Thanks for the broader context @pranayasinghcsmpl ! Given all the artifacts, it's indeed a bit more complex and the mixin won't solve it all. Then the best way to make a gandlf hf upload command is to use upload_folder on the folder with all the artifacts. Before that, you can generate a modelcard locally (with library_name) using the ModelCard class. You can define a set of rules allow_patterns/ignore_patterns to tell which files you want to select. Same for gandlf hf download, using snapshot_download with some allow_patterns/ignore_patterns would suffice.

Let me know if you have any questions :hugs:

pranayasinghcsmpl commented 3 days ago

Hi, @Wauplin Thanks for the suggestions. We'll work on it & we'll let you know if we face any issues.

pranayasinghcsmpl commented 3 days ago

@sarthakpati - you're right, the most important files are the model folder files & the config file. The user will have to move the config file to the model folder & then upload it to huggingface. We can also discard unnecessary files using ignore_patterns and set library_name in the modelcard like @Wauplin suggested.

pranayasinghcsmpl commented 3 days ago

@Wauplin, is there any predefined way to set the version of the model in a model repo.

Wauplin commented 3 days ago

@pranayasinghcsmpl not really no. Some labs would do 1 version of the model == 1 repo on the Hub and use a convention in the repo name (e.g. "3" in meta-llama/Meta-Llama-3-8B). the advantage is to differentiate usage, discussions, analytics, etc. between model versions. Another possibility is to upload different versions to the same repo and use git tags to version the model. You can use create_tag for that. The advantage is to keep everything in the same place, especially keeping the model card and discussions. The second solution is more suitable if every time you have a new version, you expect users to switch to this one instead of having usage on both.

pranayasinghcsmpl commented 3 days ago

Ok, thanks

sarthakpati commented 2 days ago

The user will have to move the config file to the model folder

@pranayasinghcsmpl: It would be awesome if this can be automated when the user initiates the gandlf hf upload command... 😄

Thanks for all the input @Wauplin!

pranayasinghcsmpl commented 2 days ago

@sarthakpati - Maybe we could save a copy of the config file in the model folder during training. There might be a better way to solve this. let me look into it.

pranayasinghcsmpl commented 2 days ago

@sarthakpati, could we maybe convert the parameters.pkl into a yaml format. Because it seems to have a lot of values that are in the config file & some that aren't.

sarthakpati commented 2 days ago

@sarthakpati, could we maybe convert the parameters.pkl into a yaml format. Because it seems to have a lot of values that are in the config file & some that aren't.

Sure, but please add the yaml file as an addition the pkl (for backward compatibility reasons).

pranayasinghcsmpl commented 2 days ago

ok, sure