pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
https://pytorch.org/examples
BSD 3-Clause "New" or "Revised" License
22.45k stars 9.55k forks source link

[doc][c10d] fixup fsdp tutorial #1297

Closed c-p-i-o closed 2 weeks ago

c-p-i-o commented 3 weeks ago

Summary: Fix up the FSDP tutorial to get it functional again.

  1. Add missing import for load_dataset.
  2. Use checkpoint instead of _shard.checkpoint to get rid of a warning.
  3. Add nlp to requirements.txt
  4. Get rid of load_metric as this function does not exist in new datasets module.
  5. Add legacy=False to get rid of tokenizer warnings.

Test Plan: Ran the tutorial as follows and ensured that it ran successfully:

torchrun --nnodes=1 --nproc_per_node=2 T5_training.py
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793] Setting
OMP_NUM_THREADS environment variable for each process to be 1 in
default, to avoid your system being overloaded, please further tune the
variable for optimal performance in your application as needed.
W1031 09:46:49.166000 2847649 torch/distributed/run.py:793]
*****************************************
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
dict_keys(['train', 'validation', 'test'])
Size of train dataset:  (157252, 3)
Size of Validation dataset:  (5599, 3)
bFloat16 enabled for mixed precision - using bfSixteen policy
netlify[bot] commented 3 weeks ago

Deploy Preview for pytorch-examples-preview canceled.

Name Link
Latest commit cb002880d9be633528c6945bef767b738e3a52e6
Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/672a8d974a588a00083764e1
fduwjj commented 3 weeks ago

looks like running python example failed?

c-p-i-o commented 3 weeks ago

looks like running python example failed?

Unrelated to my change - but I fixed it anyway. Needed to update to a newer Python version in CI.

See the additional diff I made to .github/workflows/main_python.yml. Let me know if you want me to split this out into a separate change.

c-p-i-o commented 3 weeks ago

@fduwjj - CI is green now. As mentioned, the CI broke because of some dependency changes upstream and I had to do 3 things to fix Run Python Examples.

  1. Use newer Python in .github/workflows
  2. Pin numpy to below version 2.
  3. Pin torchvision.
c-p-i-o commented 2 weeks ago

This change will be rebased on https://github.com/pytorch/examples/pull/1299 to fix the failing Python Examples.