leomrocha / gutenberg_explore

Repository with Gutenberg exploration code, notebooks and Webpage with dynamic data exploration Report Paper/Post
MIT License
0 stars 0 forks source link

Improve tokenization #3

Open EvgeniiaVak opened 3 years ago

EvgeniiaVak commented 3 years ago

From https://github.com/leomrocha/mix_nlp/pull/5#issuecomment-767462136

Always the first step is just going through the data (in this case check the least common words, starting from botton up which should contain many errors) and try to find common patterns, then you write a small code that will take advantage of those patterns.

For example, what if you find something like: this-word and ultimate-championship? you could decide that these are either correct, or these should be split by the - hyphen.

There might be other characters that you might not want in the words, for example if you see something like my word} in this case you would want to cleanup the } character.

Be careful when cleaning data of not introducing noise for the correct ones, in this case you might want to run your cleaner only on the least frequent data for example choosing words that have 2 or less occurrences, or only 1 instance.

as for how to do this, there are first the str.split and str.replace methods

for more complex cases I woudl recommend the regex library https://docs.python.org/3/library/re.html