Open Synaps3 opened 7 years ago
Looks like numbers and punctuation are stripped as part of doc_clean_process so I think this is mostly gone by that point. Closing because it's handled later in the steps we do.
Reopening because I see a lot of "â\u0080\u0094 " getting through the doc clean process
TODO: remove all "\u00" 4 digit number fragments from the raw text outputs