Document Parsing - Add more text preprocessing.

sum244 / APICAD-artifact

MIT License

0 stars 0 forks source link

Closed jconn0 closed 5 months ago

jconn0 commented 6 months ago

More text preprocessing such as normalizing the text makes the semantic parsing more effective and easier to implement.

jconn0 commented 6 months ago

Text preprocessing ensures the input text being given to the model is more consistent.

jconn0 commented 6 months ago

Some methods include:

Normalization - converting text to lowercase, standardizing expressions, ... Helps the model treat words the same regardless of case.

Lemmatization - converting words to their base form. Helps the model recognize similarities between words in different forms.

Text cleanup - removing non words, URLs, digits,... helps the model by removing elements that are not relevant to understanding.

jconn0 commented 5 months ago

Added: Remove hyphenation from line breaks. Remove asterisks, dots at end of line, and footnotes in the format of [number].