Trying to train Reranking Model on the SOP data but getting RuntimeError:

JaredChung commented 1 year ago

Hello, I'm trying to train the Reranking Model from scratch on the SOP data but getting the below error.

ERROR - Rerank (train) - Failed after 0:01:23! Traceback (most recent calls WITHOUT Sacred internals): File "experiment_rerank.py", line 100, in main metrics = eval_function()[0] File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/training.py", line 199, in evaluate_rerank recalls_rerank, nn_dists, nn_inds = recall_function() File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/metrics.py", line 188, in recall_at_ks_rerank tgt_global=None, tgt_local=current_index.to(device)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/matcher.py", line 44, in forward tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2) RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0

fwtan commented 1 year ago

Hello, I'm trying to train the Reranking Model from scratch on the SOP data but getting the below error.

ERROR - Rerank (train) - Failed after 0:01:23! Traceback (most recent calls WITHOUT Sacred internals): File "experiment_rerank.py", line 100, in main metrics = eval_function()[0] File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/training.py", line 199, in evaluate_rerank recalls_rerank, nn_dists, nn_inds = recall_function() File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/metrics.py", line 188, in recall_at_ks_rerank tgt_global=None, tgt_local=current_index.to(device)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/matcher.py", line 44, in forward tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2) RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0

Hi,

May I know the experiment you were running? as well as the training script?

macTracyHuang commented 1 year ago

Hi,

May I know the experiment you were running? as well as the training script?

I'm facing the same error

Evaluation part is running well.

However, the rerank training failed

Traceback (most recent call last):
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "experiment_rerank.py", line 106, in main
    metrics = eval_function()[0]
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/training.py", line 201, in evaluate_rerank
    recalls_rerank, nn_dists, nn_inds = recall_function()
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/metrics.py", line 201, in recall_at_ks_rerank
    current_scores, _, _ = matcher(None, True, 
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward
    logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/matcher.py", line 48, in forward
    tgt_local  = tgt_local.flatten(2)    + self.seg_encoder(3 * src_local.new_ones((bsize, 1),  dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2)
RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0

when follow command you suggest below

python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
      query_set='train'

cp logs/nn_file_for_training/nn_inds.pkl rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl

python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
      dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl

Since the default cache_nn_inds = 'rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' in experiment_rerank.py config, while I guess we need rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl here, I append "cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl"

I also found the size of dataset when run

python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
      query_set='train'

is 59551 but the dataset size when training rerank is 60502, which I think is the reason

currently, I modify experiment_rerank.py by add query_set in config and follow the pattern in eval_global that use different loader based on it.

Is it correct to do this? thx

fwtan commented 1 year ago

Hi, May I know the experiment you were running? as well as the training script?

I'm facing the same error

Evaluation part is running well.

However, the rerank training failed

Traceback (most recent call last):
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "experiment_rerank.py", line 106, in main
    metrics = eval_function()[0]
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/training.py", line 201, in evaluate_rerank
    recalls_rerank, nn_dists, nn_inds = recall_function()
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/metrics.py", line 201, in recall_at_ks_rerank
    current_scores, _, _ = matcher(None, True, 
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward
    logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/matcher.py", line 48, in forward
    tgt_local  = tgt_local.flatten(2)    + self.seg_encoder(3 * src_local.new_ones((bsize, 1),  dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2)
RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0

when follow command you suggest below

python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
      query_set='train'

cp logs/nn_file_for_training/nn_inds.pkl rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl

python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
      dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl

Since the default cache_nn_inds = 'rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' in experiment_rerank.py config, while I guess we need rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl here, I append "cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl"

I also found the size of dataset when run

python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
      query_set='train'

is 59551 but the dataset size when training rerank is 60502, which I think is the reason

currently, I modify experiment_rerank.py by add query_set in config and follow the pattern in eval_global that use different loader based on it.

Is it correct to do this? thx

Hi,

No, it is incorrect.

If you're running experiments on SOP, could you try running the commands in this instruction in order? Particularly, please try to run the global evaluations for the training and test sets to obtain the KNN files before training the re-ranking model. I just try them again and they work well.

In order to train the reranking model, we need the KNN cache files for both the training set and the test set, that's why running the global evaluation and generating the KNN cache file is necessary. With these cache files ready, you can train the reranking model with a frozen backbone by running:

python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
      dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt \
      cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' \
      dataset.train_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl' \
      dataset.test_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl'

The last three lines here show the default options set in experiment_rerank.py and dataset_ingredient.py. You may want to change them if you use cache files with different names.

macTracyHuang commented 1 year ago

Hi, May I know the experiment you were running? as well as the training script?

I'm facing the same error Evaluation part is running well. However, the rerank training failed

Traceback (most recent call last):
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "experiment_rerank.py", line 106, in main
    metrics = eval_function()[0]
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/training.py", line 201, in evaluate_rerank
    recalls_rerank, nn_dists, nn_inds = recall_function()
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/metrics.py", line 201, in recall_at_ks_rerank
    current_scores, _, _ = matcher(None, True, 
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward
    logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local)
  File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/matcher.py", line 48, in forward
    tgt_local  = tgt_local.flatten(2)    + self.seg_encoder(3 * src_local.new_ones((bsize, 1),  dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2)
RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0

when follow command you suggest below

python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
      query_set='train'

cp logs/nn_file_for_training/nn_inds.pkl rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl

python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
      dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl

Since the default cache_nn_inds = 'rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' in experiment_rerank.py config, while I guess we need rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl here, I append "cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl" I also found the size of dataset when run

python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
      query_set='train'

is 59551 but the dataset size when training rerank is 60502, which I think is the reason currently, I modify experiment_rerank.py by add query_set in config and follow the pattern in eval_global that use different loader based on it. Is it correct to do this? thx

Hi,

No, it is incorrect.

If you're running experiments on SOP, could you try running the commands in this instruction in order? Particularly, please try to run the global evaluations for the training and test sets to obtain the KNN files before training the re-ranking model. I just try them again and they work well.

In order to train the reranking model, we need the KNN cache files for both the training set and the test set, that's why running the global evaluation and generating the KNN cache file is necessary. With these cache files ready, you can train the reranking model with a frozen backbone by running:

python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
      dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
      resume=rrt_sop_ckpts/rrt_r50_sop_global.pt \
      cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' \
      dataset.train_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl' \
      dataset.test_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl'

The last three lines here show the default options set in experiment_rerank.py and dataset_ingredient.py. You may want to change them if you use cache files with different names.

Thx a lot, it turns out the reason is that I did not run evaluation to generate a test knn file. Now everything works fine.

I haven't used sarced until see your code. could you suggest me the best way to access model ingredient configuration inside dataingredient? I'm trying to do different setup based on model type (resnet or other backbones I am trying).

I've tried to add one more argument in data ingredient to specify the model type. Also tried to simply add different name config to differentiate the model type. I believe there is a better way to do this, but I failed to find the solution in sacred documentation.

Thx again.

fwtan commented 1 year ago

macTracyHuang

I only used sacred in this project, not good at it either. You may want to check the official tutorial: https://sacred.readthedocs.io/en/stable/index.html

uvavision / RerankingTransformer

Trying to train Reranking Model on the SOP data but getting RuntimeError: #13