[New Model] Wikipedia fact extractor

koustuvsinha commented 6 years ago

Extract facts from Wikipedia using Elasticsearch api

Implementation ideas:

[x] Download the latest wikipedia dump
[x] Run Elasticsearch server
[ ] Query by entity
[ ] Return first line of the top article

The wikipedia dump can be found here. Although, given the huge size of this dump, I don't know whether it would be feasible for us to wrap this within our docker container. Easier way to tackle this problem would be to use the python wikipedia api, but as per the convai rules we are not allowed to place external api calls.

[x] Verify with the organizers.

koustuvsinha commented 6 years ago

Verified with organizers, cannot use an api.

koustuvsinha commented 6 years ago

Update:

After installing ElasticSearch and loading Wikipedia dump, the docker size is now so big that on docker commit I get the error no more size left. Probably the var boot on my server doesnt have enough space for this huge model (150GB+ size). Possible workaround : Download the indices after docker has been initialized on client side.

koustuvsinha commented 6 years ago

While zipping the indices I am out of disk space again 😢

Breakend commented 6 years ago

Should we use this? https://github.com/facebookresearch/DrQA

that way we can take a subset of the Wiki corpus or other QA corpora to ask questions about

Breakend commented 6 years ago

only 25Gb and we can train the model to retrieve the answers

koustuvsinha commented 6 years ago

oh WOW!!!!! This makes life so easy!!

koustuvsinha commented 6 years ago

Ok, I am not so sure about DrQA's performance now. Our initial understanding was if an user asks a question, we extract the entity and search the wiki dump to get a one liner. This is what I get after few iterations:

Breakend commented 6 years ago

wait, what do you mean after a few iterations? Like epochs?

koustuvsinha commented 6 years ago

^ no not epochs, this was a pretrained model. By iterations I meant number of times I tested 😛

Breakend commented 6 years ago

Maybe can try increasing the top-X https://github.com/facebookresearch/DrQA/blob/master/scripts/pipeline/interactive.py

Breakend commented 6 years ago

especially the n-docs

koustuvsinha commented 6 years ago

well we could include this model as it is good sometimes in fetching the correct answer, but the only problem is the processing time is quite a bit. We could do one thing, first assess if the question is related to the document (entity overlap), then use the vanilla DrQA to answer that. If not, we could send a request to this model to generate the answer, and later in the process send the response to the user like "Btw, you asked about blah, I think the answer is blah"

NicolasAG commented 6 years ago

closing with commit https://github.com/mike-n-7/convai/commit/59739d4eb76877ee75a13862b5e3b81feb06566e

noseworm / convai

[New Model] Wikipedia fact extractor #8