helboukkouri / character-bert

Main repository for "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters"
Apache License 2.0
195 stars 47 forks source link

MEDNLI noisy text #19

Closed loubnabnl closed 2 years ago

loubnabnl commented 2 years ago

Hello, thank you for your work, can you please provide the code to create noisy text from the MEDNLI dataset.

Thank you in advance.

helboukkouri commented 2 years ago

Hi @loubnabnl, thank you for your interest in my work. To add noise to your texts you can, for each token in your dataset, and with a chosen probability p (e.g. p=20%), apply the following method:

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

This will return a set of misspelled candidates. Then you can just randomly choose one candidate from this set, use it instead of the original token and move on to the next token.

This code has been taken from: https://norvig.com/spell-correct.html

loubnabnl commented 2 years ago

Thanks a lot!