ryscott5 / eparTextTools

BSD 3-Clause "New" or "Revised" License
2 stars 5 forks source link

OCR_DOCS frequently gets \u00** junk data which could be easily removed from the raws #6

Open Synaps3 opened 7 years ago

Synaps3 commented 7 years ago

TODO: remove all "\u00" 4 digit number fragments from the raw text outputs

Synaps3 commented 7 years ago

Looks like numbers and punctuation are stripped as part of doc_clean_process so I think this is mostly gone by that point. Closing because it's handled later in the steps we do.

Synaps3 commented 7 years ago

Reopening because I see a lot of "â\u0080\u0094 " getting through the doc clean process