thunlp / OpenMatch

An Open-Source Package for Information Retrieval.
MIT License
447 stars 42 forks source link

复现Roberta-Large和ELECTRA-Large的问题 #46

Open yiyaxiaozhi opened 2 years ago

yiyaxiaozhi commented 2 years ago

我使用的环境是 pytorch 1.4.0 transformers 2.8.0 参照着文档https://github.com/thunlp/OpenMatch/blob/master/docs/experiments-msmarco.md 中的训练命令

CUDA_VISIBLE_DEVICES=0 \
python train.py\
        -task ranking \
        -model bert \
        -train ./data/train.jsonl \
        -max_input 3000000 \
        -save ./checkpoints/electra_large.bin \
        -dev queries=./data/queries.dev.small.tsv,docs=./data/collection.tsv,qrels=./data/qrels.dev.small.tsv,trec=./data/run.msmarco-passage.dev.small.100.trec \
        -qrels ./data/qrels.dev.small.tsv \
        -vocab google/electra-large-discriminator \
        -pretrain google/electra-large-discriminator \
        -res ./results/electra_large.trec \
        -metric mrr_cut_10 \
        -max_query_len 32 \
        -max_doc_len 256 \
        -epoch 1 \
        -batch_size 2 \
        -lr 5e-6 \
        -eval_every 10000

训练到global step 约18w local step约72w的时候,训练的验证MRR就会从0.33一直往下降,且整个训练过程结束后,最高的MRR只到了0.336.是什么地方遗漏了会导致这样的问题呢?

Yu-Shi commented 2 years ago

您试试加大batch size可不可以呢?您可以采用多卡训练或者gradient accumulation