question about emb_device and device for History class

Chen-Cai-OSU commented 2 years ago

Hello Matthias, Thank you very much for the code. Nice work as always. I was wondering what is the difference between emb_device and device for the History class?

When I initialized a GCN like this

model = GCN(10, 10, 10, 10, 5, device='cpu')
print(model)

I get

GCN(
  (histories): ModuleList(
    (0): History(10, 10, emb_device=cpu, device=cpu)
    (1): History(10, 10, emb_device=cpu, device=cpu)
    (2): History(10, 10, emb_device=cpu, device=cpu)
    (3): History(10, 10, emb_device=cpu, device=cpu)
  )
  (lins): ModuleList()
  (convs): ModuleList(
    (0): GCNConv(10, 10)
    (1): GCNConv(10, 10)
    (2): GCNConv(10, 10)
    (3): GCNConv(10, 10)
    (4): GCNConv(10, 10)
  )
  (bns): ModuleList(
    (0): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
)

but I noticed that there is a process in cuda:0 (I have multiple gpus), which I don't understand why. Is this desirable behavior? Also, in general, should I always set the device in GCN class as none? I noticed this is what you did in the large_benchmark/main.py.

rusty1s commented 2 years ago

device refers to the device your model is on, while emb_device refers to the device where the historical embeddings are stored. In general, device=cuda and emb_device=cpu. Note that the device will be automatically set in case you call model.to(device).

Chen-Cai-OSU commented 2 years ago

Thank you for the explanation. What I don't understand is that when I run the following code,

model = GCN(10, 10, 10, 10, 2, device='cpu').to('cuda:3')
print(model)

I got

GCN(
  (histories): ModuleList(
    (0): History(10, 10, emb_device=cpu, device=cuda:3)
  )
  (lins): ModuleList()
  (convs): ModuleList(
    (0): GCNConv(10, 10)
    (1): GCNConv(10, 10)
  )
  (bns): ModuleList(
    (0): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
)

I observe there is a process both on cuda:0 and cuda:3. (I expect only cuda:3 is used) Does that mean the emb_device is somehow not CPU? I also print out the self.emb = torch.empty(num_embeddings, embedding_dim, device=device, pin_memory=pin_memory) when History class is initialized and it is indeed the CPU. I just don't how why cuda:0 is used.

I am using torch 1.10.0 + cuda 11.3 + pyg 2.0.4 + python 3.7.13. Let me know if you need more info. Thank you!

rusty1s commented 2 years ago

Yes, this looks correct to me. Histories will be on CPU while model parameters are on cuda:3. If there is a process running on cuda:0, that is definitely a bug I can try to look into. Any pointers highly appreciated.

Chen-Cai-OSU commented 2 years ago

I don't know what the possible reasons are. I also tried pyg=2.0.4 + torch 1.7.1 + cuda 11.0 and get the same error. To reproduce the error, just add the following line in models/gcn.py

if __name__ == 'main':
  model = GCN(10, 10, 10, 10, 5, device='cpu')
  print(model)

and run python -m torch_geometric_autoscale.models.gcn

rusty1s / pyg_autoscale

question about emb_device and device for History class #16