Closed JaredChung closed 1 year ago
Hello, I'm trying to train the Reranking Model from scratch on the SOP data but getting the below error.
ERROR - Rerank (train) - Failed after 0:01:23! Traceback (most recent calls WITHOUT Sacred internals): File "experiment_rerank.py", line 100, in main metrics = eval_function()[0] File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/training.py", line 199, in evaluate_rerank recalls_rerank, nn_dists, nn_inds = recall_function() File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/metrics.py", line 188, in recall_at_ks_rerank tgt_global=None, tgt_local=current_index.to(device)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/matcher.py", line 44, in forward tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2) RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0
Hi,
May I know the experiment you were running? as well as the training script?
Hi,
May I know the experiment you were running? as well as the training script?
I'm facing the same error
Evaluation part is running well.
However, the rerank training failed
Traceback (most recent call last):
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
self.result = self.main_function(*args)
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "experiment_rerank.py", line 106, in main
metrics = eval_function()[0]
File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/training.py", line 201, in evaluate_rerank
recalls_rerank, nn_dists, nn_inds = recall_function()
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/metrics.py", line 201, in recall_at_ks_rerank
current_scores, _, _ = matcher(None, True,
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward
logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local)
File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/matcher.py", line 48, in forward
tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2)
RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0
when follow command you suggest below
python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
query_set='train'
cp logs/nn_file_for_training/nn_inds.pkl rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl
python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
resume=rrt_sop_ckpts/rrt_r50_sop_global.pt cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl
Since the default cache_nn_inds = 'rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' in experiment_rerank.py config, while I guess we need rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl here, I append "cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl"
I also found the size of dataset when run
python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \
resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \
query_set='train'
is 59551 but the dataset size when training rerank is 60502, which I think is the reason
currently, I modify experiment_rerank.py by add query_set in config and follow the pattern in eval_global that use different loader based on it.
Is it correct to do this? thx
Hi, May I know the experiment you were running? as well as the training script?
I'm facing the same error
Evaluation part is running well.
However, the rerank training failed
Traceback (most recent call last): File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline return self.run( File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run run() File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__ self.result = self.main_function(*args) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function result = wrapped(*args, **kwargs) File "experiment_rerank.py", line 106, in main metrics = eval_function()[0] File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/training.py", line 201, in evaluate_rerank recalls_rerank, nn_dists, nn_inds = recall_function() File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/metrics.py", line 201, in recall_at_ks_rerank current_scores, _, _ = matcher(None, True, File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/matcher.py", line 48, in forward tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2) RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0
when follow command you suggest below
python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \ query_set='train' cp logs/nn_file_for_training/nn_inds.pkl rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \ dataset.sop_rerank model.resnet50 model.freeze_backbone=True \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl
Since the default cache_nn_inds = 'rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' in experiment_rerank.py config, while I guess we need rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl here, I append "cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl"
I also found the size of dataset when run
python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \ query_set='train'
is 59551 but the dataset size when training rerank is 60502, which I think is the reason
currently, I modify experiment_rerank.py by add query_set in config and follow the pattern in eval_global that use different loader based on it.
Is it correct to do this? thx
Hi,
No, it is incorrect.
If you're running experiments on SOP, could you try running the commands in this instruction in order? Particularly, please try to run the global evaluations for the training and test sets to obtain the KNN files before training the re-ranking model. I just try them again and they work well.
In order to train the reranking model, we need the KNN cache files for both the training set and the test set, that's why running the global evaluation and generating the KNN cache file is necessary. With these cache files ready, you can train the reranking model with a frozen backbone by running:
python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \
dataset.sop_rerank model.resnet50 model.freeze_backbone=True \
resume=rrt_sop_ckpts/rrt_r50_sop_global.pt \
cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' \
dataset.train_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl' \
dataset.test_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl'
The last three lines here show the default options set in experiment_rerank.py
and dataset_ingredient.py
.
You may want to change them if you use cache files with different names.
Hi, May I know the experiment you were running? as well as the training script?
I'm facing the same error Evaluation part is running well. However, the rerank training failed
Traceback (most recent call last): File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline return self.run( File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run run() File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__ self.result = self.main_function(*args) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function result = wrapped(*args, **kwargs) File "experiment_rerank.py", line 106, in main metrics = eval_function()[0] File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/training.py", line 201, in evaluate_rerank recalls_rerank, nn_dists, nn_inds = recall_function() File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/utils/metrics.py", line 201, in recall_at_ks_rerank current_scores, _, _ = matcher(None, True, File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local) File "/home/tracy/miniconda3/envs/lcd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/tracy/Documents/LCD/RerankingTransformer/RRT_SOP/models/matcher.py", line 48, in forward tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2) RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0
when follow command you suggest below
python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \ query_set='train' cp logs/nn_file_for_training/nn_inds.pkl rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \ dataset.sop_rerank model.resnet50 model.freeze_backbone=True \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl
Since the default cache_nn_inds = 'rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' in experiment_rerank.py config, while I guess we need rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl here, I append "cache_nn_inds=rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl" I also found the size of dataset when run
python eval_global.py -F logs/nn_file_for_training with temp_dir=logs/nn_file_for_training \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt dataset.sop_global model.resnet50 \ query_set='train'
is 59551 but the dataset size when training rerank is 60502, which I think is the reason currently, I modify experiment_rerank.py by add query_set in config and follow the pattern in eval_global that use different loader based on it. Is it correct to do this? thx
Hi,
No, it is incorrect.
If you're running experiments on SOP, could you try running the commands in this instruction in order? Particularly, please try to run the global evaluations for the training and test sets to obtain the KNN files before training the re-ranking model. I just try them again and they work well.
In order to train the reranking model, we need the KNN cache files for both the training set and the test set, that's why running the global evaluation and generating the KNN cache file is necessary. With these cache files ready, you can train the reranking model with a frozen backbone by running:
python experiment_rerank.py -F logs/train_rerank_frozen_r50 with temp_dir=logs/train_rerank_frozen_r50 \ dataset.sop_rerank model.resnet50 model.freeze_backbone=True \ resume=rrt_sop_ckpts/rrt_r50_sop_global.pt \ cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl' \ dataset.train_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_train.pkl' \ dataset.test_cache_nn_inds='rrt_sop_caches/rrt_r50_sop_nn_inds_test.pkl'
The last three lines here show the default options set in
experiment_rerank.py
anddataset_ingredient.py
. You may want to change them if you use cache files with different names.
Thx a lot, it turns out the reason is that I did not run evaluation to generate a test knn file. Now everything works fine.
I haven't used sarced until see your code. could you suggest me the best way to access model ingredient configuration inside dataingredient? I'm trying to do different setup based on model type (resnet or other backbones I am trying).
I've tried to add one more argument in data ingredient to specify the model type. Also tried to simply add different name config to differentiate the model type. I believe there is a better way to do this, but I failed to find the solution in sacred documentation.
Thx again.
macTracyHuang
I only used sacred in this project, not good at it either. You may want to check the official tutorial: https://sacred.readthedocs.io/en/stable/index.html
Hello, I'm trying to train the Reranking Model from scratch on the SOP data but getting the below error.
ERROR - Rerank (train) - Failed after 0:01:23! Traceback (most recent calls WITHOUT Sacred internals): File "experiment_rerank.py", line 100, in main metrics = eval_function()[0] File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/training.py", line 199, in evaluate_rerank recalls_rerank, nn_dists, nn_inds = recall_function() File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/utils/metrics.py", line 188, in recall_at_ks_rerank tgt_global=None, tgt_local=current_index.to(device)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/base_model.py", line 67, in forward logits = self.matcher(src_global=src_global, src_local=src_local, tgt_global=tgt_global, tgt_local=tgt_local) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/jupyter/image-search/experimentation/modelling/RerankingTransformer/RRT_SOP/models/matcher.py", line 44, in forward tgt_local = tgt_local.flatten(2) + self.seg_encoder(3 * src_local.new_ones((bsize, 1), dtype=torch.long)).permute(0, 2, 1) + pos_embed.flatten(2) RuntimeError: The size of tensor a (2551) must match the size of tensor b (3000) at non-singleton dimension 0