Question: Strip non-UTF characters in data?

colinxfleming commented 6 years ago

hey folks! I was loading this into postgres to poke at and ran into some errors - my copy statement choked on lines with non-UTF8 characters in them, such as L31119 in homicides-data.csv. I was able to work around it no problem, but figured I'd pay it forward and check --

I wanted to ask whether it would be helpful to convert these characters to something UTF friendly. Feel free to close this issue if you all would rather not; if that would be helpful, please let me know and I'll spin up a PR for it.

Thanks again for making this data public!

colinxfleming commented 6 years ago

Should include an example row:

Lou-000444,20100809,ODONNELL,APRIL,White,21,Female,Louisville,KY,38.1841416,-85.605567,Closed by arrest

y2kbowen commented 6 years ago

Fixed in my #6 pull request. I couldn't get pandas to read the data so I found several and deleted the invalid characters. Pandas reads them fine now

y2kbowen commented 4 years ago

Pandas reports the line where it has the problem reading the data. I used Notepad++ and turned on the feature to allow me to see the special characters and changed them.I remember it being a problem with only a few rows in the data. I checked in the changes in my clone of the data here https://github.com/y2kbowen/data-homicides. You can see the lines that have problems and the changes I made here https://github.com/y2kbowen/data-homicides/commit/99294e0db933fc1b2914549420654d2827df9ccd

I hope this helps

KB

On Mon, Apr 27, 2020 at 6:24 PM msmith2024 notifications@github.com wrote:

I am unable to read in pandas, how did you correct the problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/washingtonpost/data-homicides/issues/4#issuecomment-620287164, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGCS6CIZZSRTFI52YYUTO3ROYH4NANCNFSM4FEOFYCQ .

washingtonpost / data-homicides

Question: Strip non-UTF characters in data? #4