mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

Checkpointing DLRMv2 #639

Open mailvijayasingh opened 1 year ago

mailvijayasingh commented 1 year ago

Copying the issue link that I had posted on dlrm repo: https://github.com/facebookresearch/dlrm/issues/346#issue-1687587014

The content goes here: I tried to use torchsnapshot to save checkpoints of the model in torchrec implementation. I made following changes in the dlrm_main.py for the purpose.

for batched_iterator in batched(iterator, n):
        for it in itertools.count(start_it):
            try:
                if is_rank_zero and print_lr:
                    for i, g in enumerate(pipeline._optimizer.param_groups):
                        print(f"lr: {it} {i} {g['lr']:.6f}")
                pipeline.progress(batched_iterator)
                lr_scheduler.step()
                if is_rank_zero:
                    pbar.update(1)
                snapshot = torchsnapshot.Snapshot.take(path="embedding_shards",
                app_state=app_state,
                replicated=["**"],
                        )
            except StopIteration:
                if is_rank_zero:
                    print("Total number of iterations:", it)
                start_it = it
                break

I did get some weights saved in the embedding_shards directory however I am not sure how to interpret the saved directory.

In the directory embedding_shards, I see two directories - batched and sharded. batched has 8 files (names are uuids)- a total of size 196 GB

sharded has following files with a total size of 98 GB: model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_11.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_22.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_9194304_0

erichan1 commented 1 year ago

cc @janekl @samiwilf any thoughts here on the checkpointing issue? I can tag torchrec folks if this continues to be a blocker.

yifuwang commented 1 year ago

Is there a way to gather all the weights on CPU and then dump or extract embedding tables, sharded layer weights, gather those and then dump the tables on the host.

Can you check the "Snapshot Content Access" section in https://pytorch.org/torchsnapshot/main/getting_started.html to see if the example meets your need? For example:

t_cat_0_weight = snapshot.read_object(path="0/model/sparse_arch/embedding_bag_collection/embedding_bags/t_cat_9")

What is the way to load these weights on a node with different number of GPUs or even on CPU?

For TorchRec-based DLRM models, you simply can Snapshot.restore() with a snapshot created with a different world size.

For more info, see:

While using snapshot, shall I be calling take within rank 0 or using it the way pointed in the code snippet above is sufficient?

Snapshot.take() is a collective function that needed to be called simultaneously on all ranks.

janekl commented 1 year ago

Regarding

I am not sure how to interpret the saved directory and how to read all the embedding tables from the output shown above.

There are 26 categories / embedding tables in the model with row counts given by --num_embeddings_per_feature parameter, see README.

From that, we can see that the largest tables for cat_0, cat_9, cat_19, cat_20 and cat_21. These are also categories you see in "sharded" directory you mentioned. I believe that the suffix in file names are start indices (offsets) for a given category as they never exceed --num_embeddings_per_feature (when a table is split into multiple files when taking a snapshot).

When running https://github.com/mlcommons/training/blob/master/recommendation_v2/torchrec_dlrm/dlrm_main.py you could also add --print_sharding_plan flag to see how tables are distributed across ranks. This could also help you to interpret files saved.