Closed 1024er closed 2 years ago
I used python -m wikiextractor.WikiExtractor --no-templates --json --processes 200 enwiki-latest-pages-articles.xml
and got 16 million passages, about 72% amount of your provided "22 million passages".
So will you please share your wikiextractor command ? thank you @jzhoubu
Hi @1024er . Unfortunately, I didn't store any of the wikiextractor commands. As far as I can recall, I didn't use any option related to templates.
Thank you for your reply. @jzhoubu
I also tried : python -m wikiextractor.WikiExtractor --json --processes 200 ../enwiki-latest-pages-articles.xml
still got 16 million passages.
will you please provide the full name of "snapshot 03-01-2021 of an English Wikipedia dump", I am not sure whether it is March 01, or January 03.
Thank you
will you please provide the full name of "snapshot 03-01-2021 of an English Wikipedia dump", I am not sure whether it is March 01, or January 03.
Thank you
It's March 01. Meanwhile, I think either Jan 2021 or March 2021 can exact more than 21m passages. DPR uses an older Wikipedia dump and has 21m passages. This highly depends on the preprocessing pipelines.
@1024er Not sure if that email can reach you. For your fine-tuning question, I will also reply here.
Please use this command to fine-tune instead:
python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py
hydra.run.dir=./experiments/finetune_nq/train
model_file=../../pretrain_hlp/train/dpr_biencoder.best
train_datasets=[nq_train] dev_datasets=[nq_dev]
train=finetune_8xV100
For fine-tuning, use train=finetune_8xV100
for all the commands (training/embedding/evaluation phrase).
While your results are worse than the zero-shot setting, I guess main be also other problems there. Could you also provide me below information so that I can better help:
pretrain_8xV100.yaml
file?../../pretrain_hlp/train/dpr_biencoder.best
the same one to our provided ckpt hlp20210726.best?Thanks.
Finally, with the help of @jzhoubu , I reproduce the results on NQ dataset. Thank you so much !
NQ | |
---|---|
Zero-shot Fine-tune |
top5 / top20 / top100 |
HLP (origin) | 51.2 / 70.2 / 82.0 70.9 / 81.4 / 88.0 |
HLP (latest) | 50.9 / 69.3 / 82.1 70.6 / 81.3 / 88.0 |
reproduce (on 8xA100) | 51.6 / 69.1 / 82.2 70.2 / 81.6 / 87.8 |
reproduce (80% data on 8xV100) | 50.6 / 69.4 / 82.2 69.1 / 81.1 / 87.4 |
Hello, I see that WikiExtractor has some options, do you any of use these options ?
usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2] [--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES] [-q] [--debug] [-a] [-v] input
Thank you