Can I train with my own data?

texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.

http://tevatron.ai

Apache License 2.0

494 stars 94 forks source link

Can I train with my own data? #84

Closed zhengmq2010 closed 1 year ago

zhengmq2010 commented 1 year ago

Can I train with my own data? How do I modify the augments?

MXueguang commented 1 year ago

Hi @zhengmq2010

yes, you can process your data in jsonl format, with each line a dictionary as below:

{
   "query_id": "<query id>",
   "query": "<query text>",
   "positive_passages": [
     {"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
   ],
   "negative_passages": [
     {"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
   ]
}

and when you do training

CUDA_VISIBLE_DEVICES=0 python -m tevatron.driver.train \
  ...
  --dataset_name Tevatron/msmarco-passage \    (this will use the data process script but not the msmarco data)
  --train_dir <path to the custom jsonl file>
  ...

zhengmq2010 commented 1 year ago

Thanks for detailed reply! Specifying "dataset_name" when using my data means the code will automatically process my data to the required format? If I want to use NQ dataset, do I still specify "Tevatron/msmarco-passage"?

MXueguang commented 1 year ago

yes, for NQ

python -m tevatron.driver.train \
   ...
  --dataset_name Tevatron/wikipedia-nq \
  --train_dir train_dir \
...

where the data format is

{
   "query_id": "<query id>",
   "query": "<query text>",
   "answers": ["<answer>"],
   "positive_passages": [
     {"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
   ],
   "negative_passages": [
     {"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"}
   ]
}

zhengmq2010 commented 1 year ago

If I want to use TriviaQA dataset or other datasets, which have the same format with NQ, Should I still specify the "dataset_name" as "Tevatron/wikipedia-nq"? What files should be included in the "train_dir", just train and dev files? Is there a file naming requirement?

MXueguang commented 1 year ago

yeah, you can just use "Tevatron/wikipedia-nq" if the format is same as NQ. and put the jsonl file in train_dir. only train data, we dont do eval during training. no naming requirement, but should end with .jsonl.

zhengmq2010 commented 1 year ago

I have one more question. Can the saved model from "example_dpr.md" be directly used in DPR official code?

MXueguang commented 1 year ago

iirc, dpr official code have bert layer name modified, our checkpoint can be directly loaded by BertModel. so if you want to do inference using dpr repo code, you need to modify the layer names. but I think further finetuning should be fine.

zhengmq2010 commented 1 year ago

Thanks for your patient reply! I will give it a try.

zhaobinNF commented 8 months ago

when the dataset is to large, there is always the following error, how to solve

MXueguang commented 8 months ago

Hi, How large is your data? No space left on device seems becauses the huggingface cache directory is running out of space.

zhaobinNF commented 8 months ago

Hi, How large is your data? No space left on device seems becauses the huggingface cache directory is running out of space.

The dataset size is 50+G with 170w examples, this error seems to occur when generating the dataset train split.

MXueguang commented 8 months ago

How much space left in your /root/.cache? Huggingface dataset will first convert json to pyarrow files stored here.

zhaobinNF commented 8 months ago

How much space left in your /root/.cache? Huggingface dataset will first convert json to pyarrow files stored here.

I changed the cache_dir to a bigger dir when loading the dataset, then it works. Thanks for your advice