Input prompt does not match output prompt when input contractions are already tokenized (ex. `mother 's`)

hltcoe / sandle

Run a large language modeling SANDbox in your Local Environment

Other

7 stars 1 forks source link

Input prompt does not match output prompt when input contractions are already tokenized (ex. `mother 's`) #59

Closed ccmaymay closed 2 years ago

ccmaymay commented 2 years ago

From @nweir127:

exception for mismatched input/output seqs is broken when the input text has any whitespace before apostrophes ("mother's vs mother 's") -- the tokenizer.decode() call needs to set clean_up_tokenization_spaces to false or else it will remove that whitespace and the sequences won't align

Would be a good bug to use to kick off regression testing.

jackjyzhang commented 2 years ago

I'm using a simple fix text = tokenizer.decode(tokenizer.encode(text)) that changes the text in _complete before feeding it to the model. This might have other unintended consequences but seems to work fine on my side for now

ccmaymay commented 2 years ago

@nweir127 I made the fix you suggested and this issue appears to be resolved, thank you. Are there any unintended consequences of turning off clean_up_tokenization_spaces? I haven't seen anything so far but I'm wondering if I'm just not looking in the right places.