ryscott5 / eparTextTools

BSD 3-Clause "New" or "Revised" License
2 stars 5 forks source link

Erroneous terms in the corpus #19

Closed adamlhayes closed 7 years ago

adamlhayes commented 7 years ago

Reading pdfs into a corpus can produce a lot of terms with "â" (e.g., "theâ") at the beginning or end of the word as well as erroneous terms like: "â\u0080\u0093".

Consider including an additional function within doc_clean_process() to get rid of select special characters, something like the following: removeSpecialChars <- content_transformer(function(x) gsub("[^a-zA-Z&-]"," ",x))

Synaps3 commented 7 years ago

We should merge this bug with bug #6 as they are both about the same thing.

Graham

On Wed, Jun 28, 2017 at 12:13 PM, adamlhayes notifications@github.com wrote:

Reading pdfs into a corpus can produce a lot of terms with "â" at the beginning or end of the word as well as erroneous terms like: "â\u0080\u0093".

Consider including an additional function within doc_clean_process() to get rid of select special characters, something like the following: removeSpecialChars <- content_transformer(function(x) gsub("[^a-zA-Z&-]"," ",x))

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ryscott5/eparTextTools/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOteg9Va8IRDVo3d1vHOiGQQ8JYocnvks5sIqXVgaJpZM4OIZtL .

Synaps3 commented 7 years ago

Closing, duplicate of #6