张量被分配在不同设备上

npclu0609 commented 2 weeks ago

您好运行的时候报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 您知道是哪里出了问题吗 2024-09-18 13:51:50: Experiment log path in: /Project/Experiment-Project/STGormer/experiments/NYCBike1/20240918-135150 2024-09-18 13:51:50: Experiment configs are: Namespace(seed=31, device='cuda', mode='train', best_path=None, debug=False, data_dir='data', dataset='NYCBike1', input_length=19, output_length=1, batch_size=32, test_batch_size=32, graph_file='data/NYCBike1/adj_mx.npz', num_nodes=128, num_timestamps=168, tod_scaler=1, steps_per_day=24, layers=['S', 'T'], layer_depth=3, pos_embed_T='timepos', cen_embed_S=True, attn_bias_S=True, attn_mask_S=False, attn_mask_T=False, moe_status='SoftMoE', moe_mlr=False, num_experts=6, moe_dropout=0.1, top_k=1, moe_add_ff=False, moe_position='Full', expertWeightsAda=False, expertWeights=[0.8, 0.2], d_input=4, d_output=2, d_model=64, d_time_embed=24, d_space_embed=24, num_heads=4, mlp_ratio=4, dropout=0.1, yita=0.5, fft_status=False, epochs=200, lr_init=0.001, scheduler='StepLR', step_size=25, milestones=[1, 60, 90, 120, 150], factor=0.8, patience=10, gamma=0.5, mask_value_train=5.0, mask_value_test=5.0, early_stop=True, early_stop_patience=30, grad_norm=True, max_grad_norm=5, use_dwa=False, temp=4, save_path=None, num_shortpath=16, num_node_deg=9, log_dir='/Project/Experiment-Project/STGormer/experiments/NYCBike1/20240918-135150') 2024-09-18 13:51:50: Traceback (most recent call last): File "/Project/Experiment-Project/STGormer/main.py", line 87, in model_supervisor results = trainer.train() # best_eval_loss, best_epoch ^^^^^^^^^^^^^^^ File "/Project/Experiment-Project/STGormer/model/trainer.py", line 107, in train train_epoch_loss = self.train_epoch(epoch) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Project/Experiment-Project/STGormer/model/trainer.py", line 57, in train_epoch repr, aux_loss = self.model(data, self.graph) # [B,N,C] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Project/Experiment-Project/STGormer/model/models.py", line 48, in forward repr, aux_loss = self.encoder(view, graph) #[B, N, T, D] ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Project/Experiment-Project/STGormer/model/layers.py", line 75, in forward encoderinput, = self.positional_encoding_1d(encoder_input) # BN, T, D ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Project/Experiment-Project/STGormer/model/positional_encoding.py", line 13, in forward pos_enc = tp_enc_1d(input_data) ^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/positional_encodings/torch_encodings.py", line 41, in forward sin_inp_x = torch.einsum("i,j->ij", pos_x, self.inv_freq) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/.conda/envs/stgormer/lib/python3.12/site-packages/torch/functional.py", line 386, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

jasonz5 commented 1 week ago

你好，抱歉回复的有点晚。当前版本的代码我在本地测试是没有报错的。关于变量不在同一设备，可能是因为不同环境中设备配置的差异。关于解决该问题，我以往是根据报错定位到具体行，设置断点<import ipdb; ipdb.set_trace()>，然后检查下变量或模型的位置<model/tensor.device()>，然后将变量或模型移动到统一位置<tensor1.to(tensor2.device())> 希望能对你有所帮助。

npclu0609 commented 1 week ago

好的好的，我会按照你的方法试一试，非常感谢！

在 2024-09-29 00:00:27，"Jason Zhou" @.***> 写道：

你好，抱歉回复的有点晚。当前版本的代码我在本地测试是没有报错的。关于变量不在同一设备，可能是因为不同环境中设备配置的差异。关于解决该问题，我以往是根据报错定位到具体行，设置断点<import ipdb; ipdb.set_trace()>，然后检查下变量或模型的位置<model/tensor.device()>，然后将变量或模型移动到统一位置<tensor1.to(tensor2.device())> 希望能对你有所帮助。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

jasonz5 / STGormer

张量被分配在不同设备上 #1