google-research-datasets / natural-questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
Apache License 2.0
938 stars 153 forks source link

Convert token to text #5

Open vishalkatiyar007 opened 5 years ago

vishalkatiyar007 commented 5 years ago

Is there a way to convert the output (currently in the form of tokens) of the model to text for easy interpretation and testing?

vishalkatiyar007 commented 5 years ago

For example, the annotator marks the long answer using byte offsets, token offsets, and an index into the list of long answer candidates: "long_answer": { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0 }. How to map these bytes and tokens to the text containing the answer.

filbertphang commented 5 years ago

you might want to try something like this

import jsonlines

INPUT_FILE = "nq-train-sample.jsonl"
START_TOKEN = 3521
END_TOKEN = 3525
QAS_ID = 4549465242785278785
REMOVE_HTML = True

def get_span_from_token_offsets(f, start_token, end_token, qas_id,
                                remove_html):
    for obj in f:
        if obj["example_id"] != qas_id:
            continue

        if remove_html:
            answer_span = [
                item["token"]
                for item in obj["document_tokens"][start_token:end_token]
                if not item["html_token"]
            ]
        else:
            answer_span = [
                item["token"]
                for item in obj["document_tokens"][start_token:end_token]
            ]

        return " ".join(answer_span)

with jsonlines.open(INPUT_FILE) as f:
    result = get_span_from_token_offsets(f, START_TOKEN, END_TOKEN, QAS_ID,
                                         REMOVE_HTML)

print(result)
Output: March 18 , 2018

you can read your prediction file to get the various start_tokens, end_tokens, and example_ids, then iteratively call the function to get a list of the prediction spans (write to file or whatever)

hope this helps!