Closed loubnabnl closed 2 years ago
Hi @loubnabnl, thank you for your interest in my work. To add noise to your texts you can, for each token in your dataset, and with a chosen probability p
(e.g. p=20%
), apply the following method:
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
This will return a set of misspelled candidates. Then you can just randomly choose one candidate from this set, use it instead of the original token and move on to the next token.
This code has been taken from: https://norvig.com/spell-correct.html
Thanks a lot!
Hello, thank you for your work, can you please provide the code to create noisy text from the MEDNLI dataset.
Thank you in advance.