Word Count Repository - Non English Languages

Handling non-English languages is a known problem. The word count logic depends heavily on regular expressions, which do not work for all languages. There are several cases to consider: a) Markdown: this one might be easily solved by converting markdown to HTML, then using an innerHtml function to extract all the text (in any language). The text would need to be run thru the UW tokenizer to actually split the extracted text into individual words. b) USFM (aligned): each word is already split into individual words c) USFM (unaligned): the text is not split, but is readily extractable; would need to run thru the tokenizer to get the individual words d) UTN: with some pre-processing, can be handled as Markdown.

@jag3773 @klappy - Please comment as needed.

unfoldingWord / uw-word-count

Word Count Repository - Non English Languages #4