Closed zw615 closed 5 years ago
Hi,
Usually it is the gradient computation graph that is allocating most of the memory. torch.distributed code in the repo actually also helps reducing memory footprint if you are training on multiple networks at each iteration. You can also look into various memory saving techniques (such as checkpointing), but they currently are not provided in this repo.
-Tongzhou
On Tue, Oct 8, 2019 at 10:37 JWarlock notifications@github.com wrote:
Hi! I wonder is there any way to distill much more data exceeding gpu memory size limit? For a large scale dataset or a typical 11G/12G gpu memory size, that can be really useful. At first, I thought state.distributed in your code is intended for that by putting distilled data into multiple gpus, then I found out I was wrong. So, any advice on this matter?
Thanks a lot!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SsnL/dataset-distillation/issues/23?email_source=notifications&email_token=ABLJMZJH3IT7HGWBHLZMCALQNSLKDA5CNFSM4I6SUB52YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HQLXGPQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLJMZPEQL64EIA5WJL3DNLQNSLKDANCNFSM4I6SUB5Q .
So the torch.distributed code in this repo is used for dataset distillation on multiple networks. Could you please give an example about how to run distributed training? I don't see any in README.
Also, I think splitting data into multiple gpus and computing gradient on each gpu is much more convenient. But that would means building a computation graph across multiple devices and backpropagating in a subgraph on each device. I wonder is that feasible? any advice?
The usage of distributed training is documented in the advanced usage page.
Maybe it is. I have not investigated that approach.
On Fri, Oct 11, 2019 at 05:19 JWarlock notifications@github.com wrote:
So the torch.distributed code in this repo is used for dataset distillation on multiple networks. Could you please give an example about how to run distributed training? I don't see any in README.
Also, I think splitting data into multiple gpus and computing gradient on each gpu is much more convenient. I wonder is that feasible? any advice?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SsnL/dataset-distillation/issues/23?email_source=notifications&email_token=ABLJMZIQ6WUDLGLE6XC77GDQOBVOZA5CNFSM4I6SUB52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA72B4I#issuecomment-541040881, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLJMZN4Q6TTTSQBDSM2WKLQOBVOZANCNFSM4I6SUB5Q .
I’m going to close this issue for now. Feel free to open a new one if you have other questions!
Hi! I wonder is there any way to distill much more data exceeding gpu memory size limit? For a large scale dataset or a typical 11G/12G gpu memory size, that can be really useful. At first, I thought
state.distributed
in your code is intended for that by putting distilled data into multiple gpus, then I found out I was wrong. It seems that this code only distills data of size fit for one gpu memory size. So, any advice on this matter?Thanks a lot!