sebastian-hofstaetter / matchmaker

Training & evaluation library for text-based neural re-ranking and dense retrieval models built with PyTorch
https://neural-ir-explorer.ec.tuwien.ac.at/
Apache License 2.0
261 stars 30 forks source link

Missing info on the readme? #3

Closed kevinmartinjos closed 4 years ago

kevinmartinjos commented 4 years ago

Hi, I am trying to reproduce the results on msmarco-passage, and I could not get train.py to run. Perhaps the readme is incomplete? What I've done so far:

  1. I ran ./generate_file_split.sh on both training.tsv and top1000dev.tsv. I gave a separate output directory for both files
  2. I did NOT copy over the number of batches that the above ^ script returns to the config file. I am following tr-msmarco-passage.yml as a template and I did not see an entry that asked for no. of batches
  3. train.py seems to expect a single config file, but the configs that it accesses are spread over both configs/models/model-config.yaml and configs/datasets/tr-msmarco-passage.yaml. So I concatenated the two files and used that as the config.
  4. Training proceeds until validation. During validation, it eventually throws the following error (relevant lines from log.txt):
    
    2020-03-09 09:41:47,129 INFO we_params: word_embeddings.token_embedder_tokens.weight
    2020-03-09 09:41:47,129 INFO params_group1: neural_ir_model.dense.weight
    2020-03-09 09:41:47,217 INFO Starting 16 data loader processes, for:train-batches-0
    2020-03-09 09:41:48,040 INFO [Epoch 0] --- Start training with queue.size:0
    2020-03-09 09:41:53,112 INFO Starting 15 data loader processes, for:eval-batches
    2020-03-09 09:41:54,176 INFO [eval_model] --- Start validation with queue.size:0
    2020-03-09 10:28:34,882 INFO -----------------------------------------------------------------------------------------
    2020-03-09 10:28:34,883 ERROR [train] Got exception:
    Traceback (most recent call last):
    File "matchmaker/train.py", line 463, in <module>
      best_metric_info,validation_cont_candidate_set,use_cache=config["validation_cont_use_cache"],output_secondary_output=False)
    File "/home/kjose/transformer-kernel-ranking/matchmaker/eval.py", line 203, in validate_model
      metrics = calculate_metrics_along_candidate_depth(ranked_results,load_qrels(validation_config["qrels"]),candidate_set,validation_config["candidate_set_from_to"])
    File "/home/kjose/transformer-kernel-ranking/matchmaker/core_metrics.py", line 58, in calculate_metrics_along_candidate_depth
      candidate_positions = np.array([candidates[d_id] for d_id in ranked_doc_ids])
    File "/home/kjose/transformer-kernel-ranking/matchmaker/core_metrics.py", line 58, in <listcomp>
      candidate_positions = np.array([candidates[d_id] for d_id in ranked_doc_ids])
    KeyError: '5246824'
    2020-03-09 10:28:34,885 INFO Exiting from training early

5. Here is my config file:

expirement_base_path: "/GW/NeuralIR/nobackup/msmarco/experiments/" tqdm_disabled: False

Output directory of ./generate_file_split.sh for training.tsv

train_tsv: "/GW/NeuralIR/nobackup/msmarco/tk_output_dir/*"

validation_cont:

Output directory of ./generate_file_split.sh for top1000dev.tsv

tsv: "/GW/NeuralIR/nobackup/msmarco/tk_val_output_dir/*"

The dev qrel file supplied with msmarco

qrels: "/GW/NeuralIR/nobackup/msmarco/qrels.dev.tsv" candidate_set_from_to: [5,100]

How is this candidate set generated? I used https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md

candidate_set_path: "/GW/NeuralIR/nobackup/msmarco/run.dev.big.converted.tsv" save_only_best: True

Doesn't need it for the time being

test:

top1000:

tsv: "/data01/hofstaetter/data/msmarco-passage/test/dev.not-subset.bm25_plain_top1000-split6/*"

qrels: "/data01/hofstaetter/data/msmarco-passage/qrels/qrels.dev.tsv"

candidate_set_max: 1000

candidate_set_path: "/data01/hofstaetter/data/msmarco-passage/fs_results/plain_bm25_best_dev.not-subset_top1000.txt"

save_secondary_output: False

pre_trained_embedding_dim: 300 vocab_directory: "/GW/NeuralIR/nobackup/msmarco/vocab/"

pre_trained_embedding: "/GW/NeuralIR/nobackup/msmarco/glove.42B.300d.txt"

Deliberately setting it to a low number so that I can get to validation fast

validate_every_n_batches: 10 validation_cont_use_cache: True

token_embedder_type: "embedding" # embedding,fasttext,bert_cls train_embedding: True sparse_gradient_embedding: True

use_fp16: False

random_seed: 208973249 # real-random (from random.org)

This used to be set to TK_v6. I could not find that model in the code base.

model: "TK_v1" validation_metric: "MRR@10" optimizer: "adam"

default group (all params are in here if not otherwise specified in param_group1_names)

param_group0_learning_rate: 0.0001 param_group0_weight_decay: 0

param_group1_names: ["dense","position_bias","position_bias_absolute"] param_group1_learning_rate: 0.001 param_group1_weight_decay: 0

embedding_optimizer: "sparse_adam" embedding_optimizer_learning_rate: 0.0001 embedding_optimizer_momentum: 0.8 # only when using sgd

disable with factor = 1

learning_rate_scheduler_patience: 10 # * validate_every_n_batches = batch count to check learning_rate_scheduler_factor: 0.5

epochs: 1 batch_size_train: 32 batch_size_eval: 256

gradient_accumulation_steps: -1

early_stopping_patience: 35 # * validate_every_n_batches = batch count to check

max_doc_length: 200 max_query_length: 30

min_doc_length: -1 min_query_length: -1

secondary_output: top_n: 20

tk_att_heads: 10 tk_att_layer: 2 tk_att_proj_dim: 30 tk_att_ff_dim: 100

tk_kernels_mu: [1.0, 0.9, 0.7, 0.5, 0.3, 0.1, -0.1, -0.3, -0.5, -0.7, -0.9] tk_kernels_sigma: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]

tk v6

tk_use_pos_agnostic: True tk_use_position_bias: True tk_use_diff_posencoding: True tk_position_bias_bin_percent: 0.2 tk_position_bias_absolute_steps: 4

knrm_kernels: 11

conv_knrm_ngrams: 3 conv_knrm_kernels: 11 conv_knrm_conv_out_dim: 128 # F in the paper

match_pyramid_conv_output_size : [16,16,16,16,16] match_pyramid_conv_kernel_size : [[3,3],[3,3],[3,3],[3,3],[3,3]] match_pyramid_adaptive_pooling_size: [[36,90],[18,60],[9,30],[6,20],[3,10]]

mv_lstm_hidden_dim: 32 mv_top_k: 10

pacrr_unified_query_length: 30 pacrr_unified_document_length: 200 pacrr_max_conv_kernel_size: 3 pacrr_conv_output_size: 32 pacrr_kmax_pooling_size: 5

salc_conv_knrm_kernels: 11 salc_conv_knrm_conv_out_dim: 128 salc_conv_knrm_dropi: 0 salc_conv_knrm_drops: 0 salc_conv_knrm_salc_dim: 300

salc_knrm_kernels: 11 salc_knrm_dropi: 0 salc_knrm_drops: 0 salc_knrm_salc_dim: 300

mm_light_kernels: 11



I suspect that the problem is with the candidate set generation. I generated it using Anserini, as described [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).

Also, the validation runs for a while. How I know this is I removed the if condition check [here](https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/eval.py#L171) so that even the "cont" validation gives an output file. As expected, I have a non-empty file. I'm not sure how to proceed debugging from here, and it would be great if someone can give some pointers
sebastian-hofstaetter commented 4 years ago

Hi, Thank you for your interest in the TK model! You are right, the readme is outdated - sorry about that :/ But i think we can fix it :)

Hope that helps, if you have any further questions, i am happy to help! Best, Sebastian

kevinmartinjos commented 4 years ago

Hi Sebastian,

Thanks for the quick reply! I'll try this and get back to you :)