[SDK] Use HuggingFace Data Collator for more Transformers in LLM Trainer

kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes

https://www.kubeflow.org/docs/components/training

Apache License 2.0

1.53k stars 664 forks source link

[SDK] Use HuggingFace Data Collator for more Transformers in LLM Trainer #2032

Open andreyvelich opened 4 months ago

andreyvelich commented 4 months ago

More context: https://github.com/kubeflow/training-operator/pull/2031#discussion_r1526533371. Currently, we apply HuggingFace Data Collator only for AutoModelForCausalLM Transformer in HF LLM Trainer.

We need to investigate if we should apply it for other Transformers for language modelling models.

live2awesome commented 4 months ago

i am interested to contribute on this just i ping on this thread if any help is required /assign

live2awesome commented 4 months ago

what type of transformer we are looking . i have look for given below transformer model Data Collator can be used

Masked Language Model - DataCollatorforLanguageModelling with mlm=True
AutoModelForSeq2SeqLM - DataCollatorForSeq2Seq
AutoModelForTokenClassification - DataCollatorForTokenClassification
AutoModelForSequenceClassification - simple padding is sufficient There are also option for Permutation Language Modelling and Whole word mask . Kindly suggest @andreyvelich @johnugeorge

andreyvelich commented 3 months ago

Thank you for your interest @live2awesome! It would be nice if you could let us know what changes we need to make to our HF LLM Trainer to support Data Collators for other Transformers. Also, we should discuss if we should add Data Collator by default to all supported transformers.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 3 weeks ago

/remove lifecycle/stale

andreyvelich commented 3 weeks ago

/lifecycle frozen