Closed Arjunsankarlal closed 5 years ago
Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)
Hi @Arjunsankarlal, code for indexing starts here https://github.com/uwnlp/denspi/blob/11ff5f8d31390384c8346e82f764c3b3c4e5b819/run_piqa.py#L655 Thanks!
Hi, is there any update on this?
I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py
is the script needed to do this. I am also assuming that we need to pass --sparse
to open/run_pred.py
to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?
Also, what confused me is that open/run_pred.py
still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path
). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181
I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.
Thank you, Bhuwan
Hi Bhuwan,
sorry for the inconvenience. Running open/dump_tfidf.py
outputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf
folder. Note that this script uses[PAR]
to split a document into paragraphs.
Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.
We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.
Jinhyuk
Thanks for the quick response Jinhyuk!
So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat
here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py
? (Since the doc_idx
seems to be inferred using an enumerate on the input docs here?).
Hi @Arjunsankarlal and @bdhingra , I just updated the code and readme so that they now support running demo for custom phrase index. Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.
Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index
It's still missing some details, which will be added soon. Thanks!
Hi, I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!