sanitizepunctuation()/removepunctuation() both do not handle a few things correctly:
[ ] They use very different lists of quote/punctuation characters, we could solve that by using variables that are re-used in both (all of these are just a start/idea and they are not correct (!!!!!!) yet):
[ ] Neither function works with most French-style punctuation, i.e. with a (protected) space before the punctuation character. That is sort of easy with the average protected space but not so easy for simple spaces, because guillemets can be used in both the French (« word ») and the German (»word«) versions.
[ ] sanitizepunctuation()'s quote sanitization is not used. Do we need it? (Actually, it seems like it should be helpful before running sentencesegmenter() or tokenizer().)
[ ] Do we need regular expressions at all?
[ ] Clean up sanitizepunctuation()'s quotes check to also match quotes in the middle of words but not apostrophes in the middle of the word?
sanitizepunctuation()/removepunctuation() both do not handle a few things correctly:
[ ] They use very different lists of quote/punctuation characters, we could solve that by using variables that are re-used in both (all of these are just a start/idea and they are not correct (!!!!!!) yet):
[ ] Neither function works with most French-style punctuation, i.e. with a (protected) space before the punctuation character. That is sort of easy with the average protected space but not so easy for simple spaces, because guillemets can be used in both the French (« word ») and the German (»word«) versions.
[ ] sanitizepunctuation()'s quote sanitization is not used. Do we need it? (Actually, it seems like it should be helpful before running sentencesegmenter() or tokenizer().)
[ ] Do we need regular expressions at all?
[ ] Clean up sanitizepunctuation()'s quotes check to also match quotes in the middle of words but not apostrophes in the middle of the word?