Open vishalkatiyar007 opened 5 years ago
For example, the annotator marks the long answer using byte offsets, token offsets, and an index into the list of long answer candidates: "long_answer": { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0 }. How to map these bytes and tokens to the text containing the answer.
you might want to try something like this
import jsonlines
INPUT_FILE = "nq-train-sample.jsonl"
START_TOKEN = 3521
END_TOKEN = 3525
QAS_ID = 4549465242785278785
REMOVE_HTML = True
def get_span_from_token_offsets(f, start_token, end_token, qas_id,
remove_html):
for obj in f:
if obj["example_id"] != qas_id:
continue
if remove_html:
answer_span = [
item["token"]
for item in obj["document_tokens"][start_token:end_token]
if not item["html_token"]
]
else:
answer_span = [
item["token"]
for item in obj["document_tokens"][start_token:end_token]
]
return " ".join(answer_span)
with jsonlines.open(INPUT_FILE) as f:
result = get_span_from_token_offsets(f, START_TOKEN, END_TOKEN, QAS_ID,
REMOVE_HTML)
print(result)
Output: March 18 , 2018
you can read your prediction file to get the various start_tokens, end_tokens, and example_ids, then iteratively call the function to get a list of the prediction spans (write to file or whatever)
hope this helps!
Is there a way to convert the output (currently in the form of tokens) of the model to text for easy interpretation and testing?