Closed adamlhayes closed 7 years ago
We should merge this bug with bug #6 as they are both about the same thing.
Graham
On Wed, Jun 28, 2017 at 12:13 PM, adamlhayes notifications@github.com wrote:
Reading pdfs into a corpus can produce a lot of terms with "â" at the beginning or end of the word as well as erroneous terms like: "â\u0080\u0093".
Consider including an additional function within doc_clean_process() to get rid of select special characters, something like the following: removeSpecialChars <- content_transformer(function(x) gsub("[^a-zA-Z&-]"," ",x))
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ryscott5/eparTextTools/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOteg9Va8IRDVo3d1vHOiGQQ8JYocnvks5sIqXVgaJpZM4OIZtL .
Closing, duplicate of #6
Reading pdfs into a corpus can produce a lot of terms with "â" (e.g., "theâ") at the beginning or end of the word as well as erroneous terms like: "â\u0080\u0093".
Consider including an additional function within doc_clean_process() to get rid of select special characters, something like the following: removeSpecialChars <- content_transformer(function(x) gsub("[^a-zA-Z&-]"," ",x))