Open TikaToka opened 1 year ago
Hi, @Sy-Zhang and team. First of all, thank you for sharing your work!
I have a question using your code.
I am trying to use your model with my visual and text features (using clip)
For charades dataset, it worked well.
However for TACoS, the cuda error occurs as below
Traceback (most recent call last): File "moment_localization/train.py", line 319, in <module> scheduler=scheduler) File "/home/jckim/2D-TAN/moment_localization/../lib/core/engine.py", line 42, in train state['optimizer'].step(closure) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper return func(*args, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/optim/adam.py", line 92, in step loss = closure() File "/home/jckim/2D-TAN/moment_localization/../lib/core/engine.py", line 31, in closure loss, output = state['network'](state['sample']) File "moment_localization/train.py", line 151, in network prediction, map_mask = model(textual_input, textual_mask, visual_input) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/2D-TAN/moment_localization/../lib/models/tan.py", line 22, in forward fused_h = self.fusion_layer(textual_input, textual_mask, map_h, map_mask) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/2D-TAN/moment_localization/../lib/models/fusion_modules/base_fusion.py", line 22, in forward txt_h = self.tex_linear(txt_h)[:,:,None,None] File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
and when i pass CUDA_LAUNCH_BLOCKING=1,
File "moment_localization/train.py", line 319, in <module> scheduler=scheduler) File "/home/jckim/2D-TAN/moment_localization/../lib/core/engine.py", line 42, in train state['optimizer'].step(closure) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper return func(*args, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/optim/adam.py", line 92, in step loss = closure() File "/home/jckim/2D-TAN/moment_localization/../lib/core/engine.py", line 31, in closure loss, output = state['network'](state['sample']) File "moment_localization/train.py", line 151, in network prediction, map_mask = model(textual_input, textual_mask, visual_input) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/2D-TAN/moment_localization/../lib/models/tan.py", line 22, in forward fused_h = self.fusion_layer(textual_input, textual_mask, map_h, map_mask) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/2D-TAN/moment_localization/../lib/models/fusion_modules/base_fusion.py", line 23, in forward map_h = self.vis_conv(map_h) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 446, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/jckim/mambaforge/envs/HLTI/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue. import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([32, 512, 128, 128], dtype=torch.float, device='cuda', requires_grad=True) net = torch.nn.Conv2d(512, 512, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().float() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize() ConvolutionParams data_type = CUDNN_DATA_FLOAT padding = [0, 0, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0x5610fdbf3970 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 32, 512, 128, 128, strideA = 8388608, 16384, 128, 1, output: TensorDescriptor 0x5610fdaa24f0 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 32, 512, 128, 128, strideA = 8388608, 16384, 128, 1, weight: FilterDescriptor 0x5610fdbe9210 type = CUDNN_DATA_FLOAT tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 512, 512, 1, 1, Pointer addresses: input: 0x7fac74000000 output: 0x7face6000000 weight: 0x7fad8db00000
I changed config file like below
WORKERS: 16 MODEL_DIR: ./models/conv RESULT_DIR: ./results/conv LOG_DIR: ./log DATA_DIR: ./data/TACoS FEATURE_DIR: {directory to my visual_features} <- custom added and worked well on charades. DATASET: NAME: TACoS VIS_INPUT_TYPE: clip NO_VAL: True NUM_SAMPLE_CLIPS: 256 TARGET_STRIDE: 2 NORMALIZE: True RANDOM_SAMPLING: False TEST: BATCH_SIZE: 32 RECALL: 1,5 TIOU: 0.1,0.3,0.5,0.7 EVAL_TRAIN: False NMS_THRESH: 0.5 CUDNN: DETERMINISTIC: False BENCHMARK: True TRAIN: BATCH_SIZE: 32 LR: 0.0001 WEIGHT_DECAY: 0.0000 MAX_EPOCH: 100 CONTINUE: False LOSS: NAME: bce_rescale_loss PARAMS: MIN_IOU: 0.3 MAX_IOU: 0.7 BIAS: 0.0 TAN: FRAME_MODULE: NAME: FrameAvgPool PARAMS: INPUT_SIZE: 512 <<< HIDDEN_SIZE: 512 KERNEL_SIZE: 2 STRIDE: 2 PROP_MODULE: NAME: SparsePropConv PARAMS: HIDDEN_SIZE: 512 NUM_SCALE_LAYERS: [16, 8, 8, 8] FUSION_MODULE: NAME: BaseFusion PARAMS: HIDDEN_SIZE: 512 TXT_INPUT_SIZE: 512 <<< TXT_HIDDEN_SIZE: 512 LSTM: NUM_LAYERS: 3 BIDIRECTIONAL: False MAP_MODULE: NAME: MapConv PARAMS: INPUT_SIZE: 512 HIDDEN_SIZES: [512, 512, 512, 512, 512, 512, 512, 512] KERNEL_SIZES: [5, 5, 5, 5, 5, 5, 5, 5] STRIDES: [1, 1, 1, 1, 1, 1, 1, 1] PADDINGS: [16, 0, 0, 0, 0, 0, 0, 0] DILATIONS: [1, 1, 1, 1, 1, 1, 1, 1] PRED_INPUT_SIZE: 512 MODEL: NAME: TAN CHECKPOINT: ./checkpoints/TACoS/iter016165-0.4644-0.7443.pkl
this type of changing config also worked well on charades.
for loading a features, i am using code like this in
./lib/dataset/tacos.py
def get_word_embedding(self, sentence): inputs = self.clip_tokenizer(sentence, return_tensors="pt") with torch.no_grad(): features = self.clip_model(**inputs) last_hidden_state_feature = features.last_hidden_state.squeeze() last_hidden_state_feature = last_hidden_state_feature return last_hidden_state_feature def get_video_features(self, vid): feature_path = os.path.join(self.feature_dir, vid + '.npz') features = torch.Tensor(np.load(feature_path)['features'][:]).float() if config.DATASET.NORMALIZE: features = F.normalize(features, dim=1) vis_mask = torch.ones((features.shape[0], 1)) return features, vis_mask
Also this code worked well when applying to charades.
the code snippet given in error code do not produce any problem
Is there any problem you might think that occurs?
as charades works well, I think it is not my gpu or system problem.
Thank you in advance!
I didn't meet this problem before. Could you try reducing the batch size to see if it is relevant to GPU memory? If not, could you try different number of GPUs to see if it is relevant to the number of GPUs?
Thank you for quick response!
I tried to reduce batch size to 4 but also didn't work. (I am using 8 Quadro RTX 8000 so I don't think the memory might cause the issue)
Also, I tried to allocate different number of gpus like 1, 2, 4, 8 ... all returned same error.
And assigning different gpus like (CUDA_VISIBLE_DEVICES=0,1,2,3 vs CUDA_VISIBLE_DEVICES=4,5,6,7) also didn't solved my problem :(
Hi, @Sy-Zhang and team. First of all, thank you for sharing your work!
I have a question using your code.
I am trying to use your model with my visual and text features (using clip)
For charades dataset, it worked well.
However for TACoS, the cuda error occurs as below
and when i pass CUDA_LAUNCH_BLOCKING=1,
I changed config file like below
this type of changing config also worked well on charades.
for loading a features, i am using code like this in
./lib/dataset/tacos.py
Also this code worked well when applying to charades.
the code snippet given in error code do not produce any problem
Is there any problem you might think that occurs?
as charades works well, I think it is not my gpu or system problem.
Thank you in advance!