Convert token to text - Githubissues

vishalkatiyar007 commented 5 years ago

Is there a way to convert the output (currently in the form of tokens) of the model to text for easy interpretation and testing?

vishalkatiyar007 commented 5 years ago

For example, the annotator marks the long answer using byte offsets, token offsets, and an index into the list of long answer candidates: "long_answer": { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0 }. How to map these bytes and tokens to the text containing the answer.

filbertphang commented 5 years ago

you might want to try something like this

import jsonlines

INPUT_FILE = "nq-train-sample.jsonl"
START_TOKEN = 3521
END_TOKEN = 3525
QAS_ID = 4549465242785278785
REMOVE_HTML = True

def get_span_from_token_offsets(f, start_token, end_token, qas_id,
                                remove_html):
    for obj in f:
        if obj["example_id"] != qas_id:
            continue

        if remove_html:
            answer_span = [
                item["token"]
                for item in obj["document_tokens"][start_token:end_token]
                if not item["html_token"]
            ]
        else:
            answer_span = [
                item["token"]
                for item in obj["document_tokens"][start_token:end_token]
            ]

        return " ".join(answer_span)

with jsonlines.open(INPUT_FILE) as f:
    result = get_span_from_token_offsets(f, START_TOKEN, END_TOKEN, QAS_ID,
                                         REMOVE_HTML)

print(result)

Output: March 18 , 2018

you can read your prediction file to get the various start_tokens, end_tokens, and example_ids, then iteratively call the function to get a list of the prediction spans (write to file or whatever)

hope this helps!

google-research-datasets / natural-questions

Convert token to text #5