This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow and PyTorch that have been optimized for Intel platforms. Scaling is done with python, Docker, kubernetes, kubeflow, cnvrg.io, Helm, and other container orchestration frameworks for use in the cloud and on-premise
Fixes an issue where processed DataLoaders could no longer be pickled in #3074 thanks to @byi8220
Fixes an issue when using FSDP where default_transformers_cls_names_to_wrap would separate _no_split_modules by characters instead of keeping it as a list of layer names in #3075
v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!
Dependency Changes
Updated Safetensors Requirement: The library now requires safetensors version 0.4.3.
Added support for Numpy 2.0: The library now fully supports numpy 2.0.0
Core
New Script Behavior Changes
Process Group Management: PyTorch now requires users to destroy process groups after training. The accelerate library will handle this automatically with accelerator.end_training(), or you can do it manually using PartialState().destroy_process_group().
MLU Device Support: Added support for saving and loading RNG states on MLU devices by @huismiling
NPU Support: Corrected backend and distributed settings when using transfer_to_npu, ensuring better performance and compatibility.
DataLoader Enhancements
Stateful DataDataLoader: We are excited to announce that early support has been added for the StatefulDataLoader from torchdata, allowing better handling of data loading states. Enable by passing use_stateful_dataloader=True to the DataLoaderConfiguration, and when calling load_state() the DataLoader will automatically be resumed from its last step, no more having to iterate through passed batches.
Decoupled Data Loader Preparation: The prepare_data_loader() function is now independent of the Accelerator, giving you more flexibility towards which API levels you would like to use.
XLA Compatibility: Added support for skipping initial batches when using XLA.
Improved State Management: Bug fixes and enhancements for saving/loading DataLoader states, ensuring smoother training sessions.
Epoch Setting: Introduced the set_epoch function for MpDeviceLoaderWrapper.
FP8 Training Improvements
Enhanced FP8 Training: Fully Sharded Data Parallelism (FSDP) and DeepSpeed support now work seamlessly with TransformerEngine FP8 training, including better defaults for the quantized FP8 weights.
Integration baseline: We've added a new suite of examples and benchmarks to ensure that our TransformerEngine integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with raw TransformersEngine, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them here
Import Fixes: Resolved issues with import checks for the Transformers Engine that has downstream issues.
FP8 Docker Images: We've added new docker images for TransformerEngine and accelerate as well. Use docker pull huggingface/accelerate@gpu-fp8-transformerengine to quickly get an environment going.
torchpippy no more, long live torch.distributed.pipelining
With the latest PyTorch release, torchpippy is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on
There are breaking examples and changes that comes from this shift. Namely:
Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of [1, n, n] rather than [2, n, n] as before.
We no longer support Encoder/Decoder models. PyTorch tracing for pipelining no longer supports encoder/decoder models, so the t5 example has been removed.
Computer vision model support currently does not work: There are some tracing issues regarding resnet's we are actively looking into.
If either of these changes are too breaking, we recommend pinning your accelerate version. If the encoder/decoder model support is actively blocking your inference using pippy, please open an issue and let us know. We can look towards adding in the old support for torchpippy potentially if needed.
Fully Sharded Data Parallelism (FSDP)
Environment Flexibility: Environment variables are now fully optional for FSDP, simplifying configuration. You can now fully create a FullyShardedDataParallelPlugin yourself manually with no need for environment patching:
from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)
FSDP RAM efficient loading: Added a utility to enable RAM-efficient model loading (by setting the proper environmental variable). This is generally needed if not using accelerate launch and need to ensure the env variables are setup properly for model loading:
from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
</tr></table>
Added C APIs for language, vision and audio processors including new FeatureExtractor for Whisper model
Support for Phi-3 Small Tokenizer and new OpenAI tiktoken format for fast loading of BPE tokenizers
Added new CUDA custom operators such as MulSigmoid, Transpose2DCast, ReplaceZero, AddSharedInput and MulSharedInput
Enhanced Custom Op Lite API on GPU and fused kernels for DORT
Bug fixes, including null bos_token for Qwen2 tokenizer and SentencePiece converted FastTokenizer issue on non-ASCII characters, as well as necessary updates for MSVC 19.40 and numpy 2.0 release
Release v0.20.0: faster encode, better python support
Release v0.20.0
This release is focused on performances and user experience.
Performances:
First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :
Python API
We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
Fix: Regression on Processor.save_pretrained caused by #31691 (#32921) Authored by @leloykun
Patch release v4.44.1
Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues
is_torchdynamo_compiling -- cast a wide exception net (#32476) by @gante
Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
Found 25/28 approved changesets -- score normalized to 8
Maintained
:green_circle: 10
30 commit(s) and 12 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices
:warning: 2
badge detected: InProgress
License
:green_circle: 9
license file detected
Signed-Releases
:warning: 0
Project has not signed or included provenance with any releases.
Branch-Protection
:warning: -1
internal error: error during GetBranch(4.2.x): error during branchesHandler.query: internal error: githubv4.Query: Resource not accessible by integration
Packaging
:warning: -1
packaging workflow not detected
Token-Permissions
:warning: 0
detected GitHub workflow tokens with excessive permissions
Dangerous-Workflow
:green_circle: 10
no dangerous workflow patterns detected
SAST
:green_circle: 10
SAST tool is run on all commits
Binary-Artifacts
:green_circle: 10
no binaries found in the repo
Security-Policy
:green_circle: 10
security policy file detected
Fuzzing
:warning: 0
project is not fuzzed
Pinned-Dependencies
:warning: 0
dependency not pinned by hash detected -- score normalized to 0
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.
Bumps the pytorch group with 13 updates in the /pytorch directory:
0.33.0
0.34.2
2.21.0
3.0.0
0.4.2
0.4.3
0.11.0
0.12.0
1.18.1
1.19.2
5.27.3
5.28.1
1.5.1
1.5.2
0.19.1
0.20.0
4.44.0
4.44.2
1.26.4
2.1.1
4.3.0b0
4.3.0b1
3.0
3.0.2
0.18.0
0.18.0+cpu
Updates
accelerate
from 0.33.0 to 0.34.2Release notes
Sourced from accelerate's releases.
... (truncated)
Commits
c61f41c
Release: v0.34.2beb4378
Release: v0.34.1e13bef2
Allow DataLoaderAdapter subclasses to be pickled by implementing__reduce__
...73a1531
Fix FSDP auto_wrap using characters instead of full str for layers (#3075)159c0dd
Release: v0.34.08931e5e
Removeskip_first_batches
support for StatefulDataloader and fix all the te...a848592
Speed up tests by shaving off subprocess when not needed (#3042)758d624
add set_epoch for MpDeviceLoaderWrapper (#3053)b07ad2a
Fix typo in comment (#3045)1d09a20
use duck-typing to ensure underlying optimizer supports schedulefree hooks (#...Updates
datasets
from 2.21.0 to 3.0.0Release notes
Sourced from datasets's releases.
Commits
3505ed9
Release: 3.0.0 (#7145)ca58154
fix streaming from arrow files (#7083)be5cff0
Test get_dataset_config_info with non-existing/gated/private dataset (#7124)e4c87a6
Disable implicit token in CI (#7126)880a52c
Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport (#7125)88f646c
Rename LargeList.dtype to LargeList.feature (#7106)3813ce8
Fix typed examples iterable state dict (#7121)cedffa5
don't mention the script if trust_remote_code=False (#7120)2878019
Usehuggingface_hub
cache (#7105)70bac27
Install transformers with numpy-2 CI (#7119)Updates
evaluate
from 0.4.2 to 0.4.3Release notes
Sourced from evaluate's releases.
Commits
5310084
version 0.4.3 (#626)0565509
remove ignore_url_params (#624)db16a6e
Replace deprecated use_auth_token with token (#621)d1a15f6
Fix CI with temporary pin nltk<3.9 (#623)5be95df
feat(ci): add trufflehog secrets detection (#600)Updates
onnxruntime-extensions
from 0.11.0 to 0.12.0Release notes
Sourced from onnxruntime-extensions's releases.
Commits
cb47d2c
Update nuget extraction path for iOS xcframework (#792)b27fbbe
Update macosx framework packaging to follow apple guidelines (#776) (#789)c7a2d45
Update build-package-for-windows.yml (#784)3ce1e9f
Upgrade ESRP signing task from v2 to v5 (#780)e113ed3
removed OpenAIAudioToText from config (#777)c9c11b4
Fix the windows API missing issue and Linux shared library size issue for Jav...c3145b8
add the decoder_prompt_id for whisper tokenizer (#775)620050f
reimplement resize cpu kernel for image processing (#768)d79299e
increase timeout (#773)735041e
increase timeout (#772)Updates
onnxruntime
from 1.18.1 to 1.19.2Release notes
Sourced from onnxruntime's releases.
... (truncated)
Commits
ffceed9
ORT 1.19.2 Release: Cherry Pick Round 1 (#21861)d651463
ORT 1.19.1 Release: Cherry Pick Round 1 (#21796)26250ae
ORT 1.19.0 Release: Cherry-Pick Round 2 (#21726)ccf6a28
ORT 1.19.0 Release: Cherry-Pick Round 1 (#21619)ee2fe87
ORT 1.19.0 Release: Cherry-Pick Round 0 (#21609)530a2d7
Enable FP16 Clip and Handle Bias in FP16 Depthwise Conv (#21493)82036b0
Remove references to the outdated CUDA EP factory method (#21549)07d3be5
CoreML: Add ML Program Split Op (#21456)5d78b9a
[TensorRT EP] Update TRT OSS Parser to 10.2 (#21552)8417c32
Keep QDQ nodes w/ nonpositive scale around MaxPool (#21182)Updates
protobuf
from 5.27.3 to 5.28.1Commits
10ef3f7
Updating version.json and repo version numbers to: 28.1d70f077
Merge pull request #18191 from protocolbuffers/cp-ruby-upb60e585c
Update staleness70b77de
Fix a potential Ruby-upb use of uninitialized memory.5b4b3af
Merge pull request #18188 from acozzette/28-fix8ea3bb1
Fix compiler error withStrongReferenceToType()
9deedf0
upb: fix uninitialized upb_MessageValue buffer bugs (#18160)3454ed8
Merge pull request #18013 from protocolbuffers/28.x-202408281753976ab41
Updating version.json and repo version numbers to: 28.1-dev439c42c
Updating version.json and repo version numbers to: 28.0Updates
scikit-learn
from 1.5.1 to 1.5.2Release notes
Sourced from scikit-learn's releases.
Commits
156ef14
[cd build] trigger ci/cd40c7416
DOC update the list of contributors for 1.5.2 (#29819)c119c7e
DOC add orphan option to developers/index.rst4d838dc
TST fix tolerance as in #294002e79f52
DOC fix entry in changelog for backport happening in 1.5.2 (#29815)c735641
MAINT install setuptools for debian-32bitsc993dd2
DOC update repr for NumPy 2.08ade4f5
MAINT bump from 1.5.1 to 1.5.204b71d2
FIX solve conflict gitb5b5017
MAINT update lock fileUpdates
tokenizers
from 0.19.1 to 0.20.0Release notes
Sourced from tokenizers's releases.
... (truncated)
Commits
a5adaac
version 0.20.0a8def07
Merge branch 'fix_release' of github.com:huggingface/tokenizers into branch_v...fe50673
Fix CIb253835
push cargofc3bb76
update dependenciesbfd9cde
Perf improvement 16% by removing offsets. (#1587)bd27fa5
add deserialize for pre tokenizers (#1603)56c9c70
Tests + Deserialization improvement for normalizers. (#1604)49dafd7
Fix strip python type (#1602)bded212
SupportNone
to reset pre_tokenizers and normalizers, and index sequences (...Updates
transformers
from 4.44.0 to 4.44.2Release notes
Sourced from transformers's releases.
Commits
1748902
v4.44.26845144
Fix regression onProcessor.save_pretrained
caused by #31691 (#32921)3d8cba8
fix: no need to dtype A in jamba (#32924)c1df7f8
fix: jamba cache fails to use torch.nn.module (#32894)ca56cd7
v4.44.16e931e1
Gemma2: fix FA2 generation (#32553)74f57df
Fix generate withinputs_embeds
as input (#32493)084fe2e
Merge branch 'v4.44-release' of github.com:huggingface/transformers into v4.4...fff9be1
Reduce the error log when using core models that need their weights renamed, ...4fd0f48
Fix VLM generation issues (#32836)Updates
numpy
from 1.26.4 to 2.1.1Release notes
Sourced from numpy's releases.
... (truncated)
Commits
48606ab
Merge pull request #27328 from charris/prepare-2.1.1a7cb4c4
REL: Prepare for the NumPy 2.1.1 release [wheel build]884c92b
Merge pull request #27303 from charris/backport-27284ca7f5c1
Merge pull request #27304 from charris/backport-270492a49507
BUG: f2py: better handle filtering of public/private subroutinesd4306dd
TST: Add regression test for gh-26920db9668d
BLD: cp311- macosx_arm64 wheels [wheel build]c6ff254
Merge pull request #27287 from charris/post-2.0.2-release-update326bc17
MAINT: Update main after the 2.0.2 release8164b7c
Merge pull request #27278 from charris/backport-27275Updates
jupyterlab
from 4.3.0b0 to 4.3.0b1Release notes
Sourced from jupyterlab's releases.
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.