Feature/allow tokenized texts

predict function now accept this inputs: texts (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string).If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences) is_split_into_words - indicate if the texts input is tokenized

Usage:

texts = [["We", "are", "so", "happy", "to", "see", "you", "using", "our", "coref", "package", ".", "This", "package", "is", "very", "fast", "!"],
["The", "man", "tried", "to", "put", "the", "boot", "on", "his", "foot", "but", "it", "was", "too", "small", "."],
["I", "have", "a", "dog", ".", "The", "dog", "\'s", "toys", "are", "really", "cool", "."]]

model = FCoref(device='cpu')
preds = model.predict(texts, is_split_into_words=True)
for p in preds:
    print(p.get_clusters())
>
[[['We'], ['our']], [['our', 'coref', 'package'], ['This', 'package']]]
[[['The', 'man'], ['his']], [['his', 'foot'], ['it']]]
[[['a', 'dog'], ['The', 'dog', "'s"]]]

Note: no changes to CorefResult. solved with:

    if nlp is not None:
        tokenized_texts = tokenize_with_spacy(batch['text'], nlp)
    else:
        tokenized_texts = batch
        tokenized_texts['offset_mapping'] = [(list(zip(range(len(tokens)), range(1, 1 + len(tokens)))))
                                             for tokens in tokenized_texts['tokens']]

shon-otmazgin / fastcoref

Feature/allow tokenized texts #15