Closed jconn0 closed 5 months ago
Text preprocessing ensures the input text being given to the model is more consistent.
Some methods include:
Normalization - converting text to lowercase, standardizing expressions, ... Helps the model treat words the same regardless of case.
Lemmatization - converting words to their base form. Helps the model recognize similarities between words in different forms.
Text cleanup - removing non words, URLs, digits,... helps the model by removing elements that are not relevant to understanding.
Added: Remove hyphenation from line breaks. Remove asterisks, dots at end of line, and footnotes in the format of [number].
More text preprocessing such as normalizing the text makes the semantic parsing more effective and easier to implement.