jyfang6 / trace

15 stars 4 forks source link

Can I use my own dataset for this and if yes then what should I pass instead of --hotpotqa inpreprocessing #1

Open ayushjadia opened 1 month ago

jyfang6 commented 1 month ago

Hi @ayushjadia,

Thank you for your interest in our work.

You can certainly use your own dataset in our TRACE model. TRACE aims to improve the multi-hop reasoning ability of the Reader model in the RAG framework. The expected inputs are questions and some documents retrieved from a corpus. The input format and data types we use is:

{
    "id": question id (str), 
    "question": question (str), 
    "answers": the answers to the question (List[str])
    "ctxs": [
        {
            "id": document id (str), 
            "title": document title (str), 
            "text": the text in the document (str), 
            "sentences": sentences in the document (List[str]), 
        }
    ]
}

Once you have the above inputs, you can go through the following steps to generate answers:

  1. KG Generation

    • If you want to use a retriever to adaptively select demonstrations when generating KG triples for a document, you need to run the get_document_demonstration_rank function in the preprocessing.py file. Otherwise, you can skip this step. In this case, the first $n$ demonstrations will be used for all the documents.
    • Then you can use the command in the "KG Generation" section to generate KG triples for each document.
    • The current prompt and demonstrations for generating KG triples are designed for Wikipedia documents, especially the documents from our experimental datasets. You can try to adapt the prompt and demonstrations to your specific text dataset.
  2. Reasoning Chain Generation

    • Once you have generated the KG triples, run the command in the "Reasoning Chain Construction" section to construct reasoning chains for each question.
    • We provide demonstrations for each experimental dataset. You can first try using these demonstrations in you own dataset. If the performance is not satisfactory, we recommend constructing demonstrations for your own dataset.
  3. Answer Generation

    • After generating the reasoning chains, run the command in the "Answer Generation" section to generate answer for each question and evaluate the performance.
    • TRACE has two different ways of using the reasoning chains to generate answers, namely TRACE-Triple, which only uses KG triples to generate answers, and TRACE-Doc, which uses KG triples to identify a subset of relevant documents and use these documents as context to generate answers. You can specify these two settings in the --context_type hyperparameter.

Feel free to contact me if you have any further questions or discussions.