microsoft / ANCE

A novel embedding training algorithm leveraging ANN search and achieved SOTA retrieval on Trec DL 2019 and OpenQA benchmarks
MIT License
359 stars 49 forks source link

Using testset to test NDCG while training #9

Open yizhilll opened 3 years ago

yizhilll commented 3 years ago

Hi,

I found that while doing the TREC DL document task, the code in the msmarco.py processes "msmarco-test2019-queries.tsv" as the dev-query file.

https://github.com/microsoft/ANCE/blob/936ec3e18b8a3fd30df91c13be650a3f8ca55f82/data/msmarco_data.py#L190

    if args.data_type == 0:
        write_query_rel(
            args,
            pid2offset,
            "msmarco-doctrain-queries.tsv",
            "msmarco-doctrain-qrels.tsv",
            "train-query",
            "train-qrel.tsv")
        write_query_rel(
            args,
            pid2offset,
            "msmarco-test2019-queries.tsv",
            "2019qrels-docs.txt",
            "dev-query",
            "dev-qrel.tsv")

If I want to reproduce your work, is it okay to use the "msmarco-docdev-queries.tsv" as devset to select the best checkpoint?