kevinmartinjos commented 4 years ago

Hi, I am trying to reproduce the results on msmarco-passage, and I could not get train.py to run. Perhaps the readme is incomplete? What I've done so far:

I ran ./generate_file_split.sh on both training.tsv and top1000dev.tsv. I gave a separate output directory for both files
I did NOT copy over the number of batches that the above ^ script returns to the config file. I am following tr-msmarco-passage.yml as a template and I did not see an entry that asked for no. of batches
train.py seems to expect a single config file, but the configs that it accesses are spread over both configs/models/model-config.yaml and configs/datasets/tr-msmarco-passage.yaml. So I concatenated the two files and used that as the config.

Training proceeds until validation. During validation, it eventually throws the following error (relevant lines from log.txt):


2020-03-09 09:41:47,129 INFO we_params: word_embeddings.token_embedder_tokens.weight
2020-03-09 09:41:47,129 INFO params_group1: neural_ir_model.dense.weight
2020-03-09 09:41:47,217 INFO Starting 16 data loader processes, for:train-batches-0
2020-03-09 09:41:48,040 INFO [Epoch 0] --- Start training with queue.size:0
2020-03-09 09:41:53,112 INFO Starting 15 data loader processes, for:eval-batches
2020-03-09 09:41:54,176 INFO [eval_model] --- Start validation with queue.size:0
2020-03-09 10:28:34,882 INFO -----------------------------------------------------------------------------------------
2020-03-09 10:28:34,883 ERROR [train] Got exception:
Traceback (most recent call last):
File "matchmaker/train.py", line 463, in <module>
  best_metric_info,validation_cont_candidate_set,use_cache=config["validation_cont_use_cache"],output_secondary_output=False)
File "/home/kjose/transformer-kernel-ranking/matchmaker/eval.py", line 203, in validate_model
  metrics = calculate_metrics_along_candidate_depth(ranked_results,load_qrels(validation_config["qrels"]),candidate_set,validation_config["candidate_set_from_to"])
File "/home/kjose/transformer-kernel-ranking/matchmaker/core_metrics.py", line 58, in calculate_metrics_along_candidate_depth
  candidate_positions = np.array([candidates[d_id] for d_id in ranked_doc_ids])
File "/home/kjose/transformer-kernel-ranking/matchmaker/core_metrics.py", line 58, in <listcomp>
  candidate_positions = np.array([candidates[d_id] for d_id in ranked_doc_ids])
KeyError: '5246824'
2020-03-09 10:28:34,885 INFO Exiting from training early


5. Here is my config file:

expirement_base_path: "/GW/NeuralIR/nobackup/msmarco/experiments/" tqdm_disabled: False

Output directory of ./generate_file_split.sh for training.tsv

train_tsv: "/GW/NeuralIR/nobackup/msmarco/tk_output_dir/*"

validation_cont:

Output directory of ./generate_file_split.sh for top1000dev.tsv

tsv: "/GW/NeuralIR/nobackup/msmarco/tk_val_output_dir/*"

The dev qrel file supplied with msmarco

qrels: "/GW/NeuralIR/nobackup/msmarco/qrels.dev.tsv" candidate_set_from_to: [5,100]

How is this candidate set generated? I used https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md

candidate_set_path: "/GW/NeuralIR/nobackup/msmarco/run.dev.big.converted.tsv" save_only_best: True

Doesn't need it for the time being

test:

top1000:

tsv: "/data01/hofstaetter/data/msmarco-passage/test/dev.not-subset.bm25_plain_top1000-split6/*"

qrels: "/data01/hofstaetter/data/msmarco-passage/qrels/qrels.dev.tsv"

candidate_set_max: 1000

candidate_set_path: "/data01/hofstaetter/data/msmarco-passage/fs_results/plain_bm25_best_dev.not-subset_top1000.txt"

save_secondary_output: False

pre_trained_embedding_dim: 300 vocab_directory: "/GW/NeuralIR/nobackup/msmarco/vocab/"

pre_trained_embedding: "/GW/NeuralIR/nobackup/msmarco/glove.42B.300d.txt"

Deliberately setting it to a low number so that I can get to validation fast

validate_every_n_batches: 10 validation_cont_use_cache: True

token_embedder_type: "embedding" # embedding,fasttext,bert_cls train_embedding: True sparse_gradient_embedding: True

use_fp16: False

random_seed: 208973249 # real-random (from random.org)

This used to be set to TK_v6. I could not find that model in the code base.

model: "TK_v1" validation_metric: "MRR@10" optimizer: "adam"

default group (all params are in here if not otherwise specified in param_group1_names)

param_group0_learning_rate: 0.0001 param_group0_weight_decay: 0

param_group1_names: ["dense","position_bias","position_bias_absolute"] param_group1_learning_rate: 0.001 param_group1_weight_decay: 0

embedding_optimizer: "sparse_adam" embedding_optimizer_learning_rate: 0.0001 embedding_optimizer_momentum: 0.8 # only when using sgd

disable with factor = 1

learning_rate_scheduler_patience: 10 # * validate_every_n_batches = batch count to check learning_rate_scheduler_factor: 0.5

epochs: 1 batch_size_train: 32 batch_size_eval: 256

gradient_accumulation_steps: -1

early_stopping_patience: 35 # * validate_every_n_batches = batch count to check

max_doc_length: 200 max_query_length: 30

min_doc_length: -1 min_query_length: -1

secondary_output: top_n: 20

tk_att_heads: 10 tk_att_layer: 2 tk_att_proj_dim: 30 tk_att_ff_dim: 100

tk_kernels_mu: [1.0, 0.9, 0.7, 0.5, 0.3, 0.1, -0.1, -0.3, -0.5, -0.7, -0.9] tk_kernels_sigma: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]

tk v6

tk_use_pos_agnostic: True tk_use_position_bias: True tk_use_diff_posencoding: True tk_position_bias_bin_percent: 0.2 tk_position_bias_absolute_steps: 4

knrm_kernels: 11

conv_knrm_ngrams: 3 conv_knrm_kernels: 11 conv_knrm_conv_out_dim: 128 # F in the paper

match_pyramid_conv_output_size : [16,16,16,16,16] match_pyramid_conv_kernel_size : [[3,3],[3,3],[3,3],[3,3],[3,3]] match_pyramid_adaptive_pooling_size: [[36,90],[18,60],[9,30],[6,20],[3,10]]

mv_lstm_hidden_dim: 32 mv_top_k: 10

pacrr_unified_query_length: 30 pacrr_unified_document_length: 200 pacrr_max_conv_kernel_size: 3 pacrr_conv_output_size: 32 pacrr_kmax_pooling_size: 5

salc_conv_knrm_kernels: 11 salc_conv_knrm_conv_out_dim: 128 salc_conv_knrm_dropi: 0 salc_conv_knrm_drops: 0 salc_conv_knrm_salc_dim: 300

salc_knrm_kernels: 11 salc_knrm_dropi: 0 salc_knrm_drops: 0 salc_knrm_salc_dim: 300

mm_light_kernels: 11



I suspect that the problem is with the candidate set generation. I generated it using Anserini, as described [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).

Also, the validation runs for a while. How I know this is I removed the if condition check [here](https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/eval.py#L171) so that even the "cont" validation gives an output file. As expected, I have a non-empty file. I'm not sure how to proceed debugging from here, and it would be great if someone can give some pointers

sebastian-hofstaetter commented 4 years ago

Hi, Thank you for your interest in the TK model! You are right, the readme is outdated - sorry about that :/ But i think we can fix it :)

The error you get is from a mismatch of query-document pairs between the validation_cont.tsv and validation_cont.candidate_set_path (the tsv needs to be created from the candidate_set_path via https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_validation_input_from_candidate_set.py without add_scores (that was an abandoned experiment^^), and i would recommend top-n set to 100, so that the continuous validation runs fast, for the end validation i normally use the top-n 1000. this whole setup is needed because we actually evaluate all re-ranking depths at once (but the model only runs once for everything in the tsv)
don't use the top1000dev.tsv (that was outdated) and you don't need to set batch counts anymore, that was also fixed (i'll update the readme accordingly)
train.py actually accepts multiple configs (sep. by a whitespace in the command line arguments) and merges them, so you can have one per collection and one per model and mix and match them easily and fast, but one works as well :)
candidate generation should work with anserini, any trec_eval compatible result format should do

Hope that helps, if you have any further questions, i am happy to help! Best, Sebastian