Open kevindeyne opened 4 years ago
I actually want to make sure we don't touch any data, so I don't want to search for specific values.
But I do think adding different languages is a good idea.
I think instead I should look at what's allowed to be in a column (based on collation, encoding) and then input values from any language based on that. Doing it this way would also prevent people from pushing their biases on data.
Maybe use Tika Apache: https://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html
Example: If we notice a column with some Japanese characters, actually use some Japanese characters.