Closed zhengmq2010 closed 1 year ago
Hi @zhengmq2010
yes, you can process your data in jsonl format, with each line a dictionary as below:
{
"query_id": "<query id>",
"query": "<query text>",
"positive_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
],
"negative_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
]
}
and when you do training
CUDA_VISIBLE_DEVICES=0 python -m tevatron.driver.train \
...
--dataset_name Tevatron/msmarco-passage \ (this will use the data process script but not the msmarco data)
--train_dir <path to the custom jsonl file>
...
Thanks for detailed reply! Specifying "dataset_name" when using my data means the code will automatically process my data to the required format? If I want to use NQ dataset, do I still specify "Tevatron/msmarco-passage"?
yes, for NQ
python -m tevatron.driver.train \
...
--dataset_name Tevatron/wikipedia-nq \
--train_dir train_dir \
...
where the data format is
{
"query_id": "<query id>",
"query": "<query text>",
"answers": ["<answer>"],
"positive_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
],
"negative_passages": [
{"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
]
}
If I want to use TriviaQA dataset or other datasets, which have the same format with NQ, Should I still specify the "dataset_name" as "Tevatron/wikipedia-nq"? What files should be included in the "train_dir", just train and dev files? Is there a file naming requirement?
yeah, you can just use "Tevatron/wikipedia-nq" if the format is same as NQ. and put the jsonl file in train_dir
.
only train data, we dont do eval during training. no naming requirement, but should end with .jsonl
.
I have one more question. Can the saved model from "example_dpr.md" be directly used in DPR official code?
iirc, dpr official code have bert layer name modified, our checkpoint can be directly loaded by BertModel. so if you want to do inference using dpr repo code, you need to modify the layer names. but I think further finetuning should be fine.
Thanks for your patient reply! I will give it a try.
when the dataset is to large, there is always the following error, how to solve
Hi, How large is your data? No space left on device seems becauses the huggingface cache directory is running out of space.
Hi, How large is your data? No space left on device seems becauses the huggingface cache directory is running out of space.
The dataset size is 50+G with 170w examples, this error seems to occur when generating the dataset train split.
How much space left in your /root/.cache? Huggingface dataset will first convert json to pyarrow files stored here.
How much space left in your /root/.cache? Huggingface dataset will first convert json to pyarrow files stored here.
I changed the cache_dir to a bigger dir when loading the dataset, then it works. Thanks for your advice
Can I train with my own data? How do I modify the augments?