Open pranayasinghcsmpl opened 2 weeks ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
I am assuming this PR makes #880 redundant.
This needs fairly robust documentation regarding usage, including references to the various HF commands used internally [ref]. I would recommend creating a separate subsection under the deployment header of usage called Deplying Models to HuggingFace Hub
and then adding details. It would be awesome if you could showcase an example in the form of a tutorial as well!
Sure, I'll get started on that.
I am assuming this PR makes #880 redundant.
yeah, it does
Hi @Wauplin @NielsRogge, would you mind taking a look at this PR to see it makes sense from the HF perspective?
Hi @sarthakpati, thanks for the ping. I've reviewed this PR + had a look to https://github.com/mlcommons/GaNDLF/issues/727 https://github.com/mlcommons/GaNDLF/pull/880 https://github.com/mlcommons/GaNDLF/pull/878. To be honest, I'm not sure to understand the added value of wrapping huggingface-cli
. For what I understand, users will type gandlf hf download ...
instead of huggingface-cli download ...
but then the behavior will be exactly the same? It that's the case, I don't think it's worth adding it. The cost of maintenance (e.g. adding new commands when huggingface_hub
gets updated) seems to outweigh the benefits in my opinion. If gandlf
usage is purely CLI-based, I understand the need for a CLI. In that case, I would either:
huggingface-cli upload
and huggingface-cli download
in the documentation, without wrapping themgandlf hf download
and gandlf hf upload
but only if additional features are added to it. For example default values from model ids depending on the GaNDLF model to load. Or make sure in gandlf hf upload
that a model card is created for the trained model which would contains some information about the training + have some metadata. Metadata is important for discoverability on the Hub, say for example to filter all models with gandlf
as library_name: https://huggingface.co/models?library=gandlf.Overall, I'm new to the GaNDLF codebase and don't want to assume how users are using it. But if you have some script based usage (e.g. users importing GaNDLF classes in their code), then I think that a tight integration as suggested in https://github.com/mlcommons/GaNDLF/pull/878 by @nielsrogge makes more sense. Being able to load / save models on the HF Hub directly is usually a much appreciated feature.
Hi @Wauplin,
Thank you for your feedback. I think the second option you suggest makes a lot of sense, since we definitely need to have added metadata. At the very least, we need to have the version of GaNDLF the model was trained on and (if applicable) the accompanying git hash. And having gandlf
as library_name
would be perfect.
@pranayasinghcsmpl - does this make sense?
Agree!
Btw, PR from @nielsrogge https://github.com/mlcommons/GaNDLF/pull/878/files would help doing that. In the Mixin, you can define metadata that will be added to the model card. Here is an integration guide about it: https://huggingface.co/docs/huggingface_hub/v0.24.0.rc0/en/guides/integrations#model-card. Once that's integrated, the gandlf upload
and gandlf download
command will simply be wrappers around the from._pretrained
and .push_to_hub
methods!
Hi @Wauplin, the GaNDLF cli exports other files apart from just the model like logs, csvs, onnx files etc. The gandlf cli has various commands to perform different stages of model development. So if an user uses gandlf run
to train a model they can then use gandlf collect-stats
to perform model evaluation which will generate more files containing plots & stats. Also to run inference on a gandlf model the user requires the config file that was used during training and all the files exported after training the model.
Therefore, it is important to be able to upload a lot of different files that are generated by different GaNDLF commands. Along with that the user might want to include the training data that was used.
If we can upload all these files using the Mixin, we can just proceed with your suggestions.
@pranayasinghcsmpl - I am unsure if uploading all the files to HF is required. I would imaging the model files (*.pth.tar
, *.onnx
, *.xml
, *.bin
) would be the most important, along with the model configuration itself. These combined with the metadata needed for HF hub to be able to parse the uploads as being gandlf
centric (using the library_name
key) should suffice.
Thanks for the broader context @pranayasinghcsmpl ! Given all the artifacts, it's indeed a bit more complex and the mixin won't solve it all. Then the best way to make a gandlf hf upload
command is to use upload_folder
on the folder with all the artifacts. Before that, you can generate a modelcard locally (with library_name
) using the ModelCard class. You can define a set of rules allow_patterns
/ignore_patterns
to tell which files you want to select. Same for gandlf hf download
, using snapshot_download
with some allow_patterns
/ignore_patterns
would suffice.
Let me know if you have any questions :hugs:
Hi, @Wauplin Thanks for the suggestions. We'll work on it & we'll let you know if we face any issues.
@sarthakpati - you're right, the most important files are the model folder files & the config file. The user will have to move the config file to the model folder & then upload it to huggingface. We can also discard unnecessary files using ignore_patterns
and set library_name
in the modelcard like @Wauplin suggested.
@Wauplin, is there any predefined way to set the version of the model in a model repo.
@pranayasinghcsmpl not really no. Some labs would do 1 version of the model == 1 repo on the Hub and use a convention in the repo name (e.g. "3" in meta-llama/Meta-Llama-3-8B
). the advantage is to differentiate usage, discussions, analytics, etc. between model versions. Another possibility is to upload different versions to the same repo and use git tags to version the model. You can use create_tag
for that. The advantage is to keep everything in the same place, especially keeping the model card and discussions. The second solution is more suitable if every time you have a new version, you expect users to switch to this one instead of having usage on both.
Ok, thanks
The user will have to move the config file to the model folder
@pranayasinghcsmpl: It would be awesome if this can be automated when the user initiates the gandlf hf upload
command... 😄
Thanks for all the input @Wauplin!
@sarthakpati - Maybe we could save a copy of the config file in the model folder during training. There might be a better way to solve this. let me look into it.
@sarthakpati, could we maybe convert the parameters.pkl into a yaml format. Because it seems to have a lot of values that are in the config file & some that aren't.
@sarthakpati, could we maybe convert the parameters.pkl into a yaml format. Because it seems to have a lot of values that are in the config file & some that aren't.
Sure, but please add the yaml file as an addition the pkl (for backward compatibility reasons).
ok, sure
Fixes #727
Proposed Changes
Checklist
CONTRIBUTING
guide has been followed.typing
is used to provide type hints, including and not limited to usingOptional
if a variable has a pre-defined value).pip install
step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].