seominjoon / denspi

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DenSPI)
https://nlp.cs.washington.edu/denspi
Apache License 2.0
200 stars 26 forks source link

How to generate dense vector and sparse vector for own data #3

Closed Arjunsankarlal closed 5 years ago

seominjoon commented 5 years ago

Hi, I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!

Arjunsankarlal commented 5 years ago

Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)

jhyuklee commented 5 years ago

Hi @Arjunsankarlal, code for indexing starts here https://github.com/uwnlp/denspi/blob/11ff5f8d31390384c8346e82f764c3b3c4e5b819/run_piqa.py#L655 Thanks!

bdhingra commented 5 years ago

Hi, is there any update on this?

I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py is the script needed to do this. I am also assuming that we need to pass --sparse to open/run_pred.py to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?

Also, what confused me is that open/run_pred.py still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181

I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.

Thank you, Bhuwan

jhyuklee commented 5 years ago

Hi Bhuwan,

sorry for the inconvenience. Running open/dump_tfidf.pyoutputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf folder. Note that this script uses[PAR] to split a document into paragraphs.

Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.

We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.

Jinhyuk

bdhingra commented 5 years ago

Thanks for the quick response Jinhyuk!

So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py? (Since the doc_idx seems to be inferred using an enumerate on the input docs here?).

jhyuklee commented 5 years ago

Yes, you are correct. See here where 'doc_idx' is used for the key of hdf5 files, and here where 'doc_idx' is used to get document scores calculated fromself.ranker.doc_mat.

seominjoon commented 5 years ago

Hi @Arjunsankarlal and @bdhingra , I just updated the code and readme so that they now support running demo for custom phrase index. Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.

Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index

It's still missing some details, which will be added soon. Thanks!