washingtonpost / data-homicides

The Washington Post collected data on more than 52,000 criminal homicides over the past decade in 50 of the largest American cities.
Other
180 stars 48 forks source link

Question: Strip non-UTF characters in data? #4

Open colinxfleming opened 6 years ago

colinxfleming commented 6 years ago

hey folks! I was loading this into postgres to poke at and ran into some errors - my copy statement choked on lines with non-UTF8 characters in them, such as L31119 in homicides-data.csv. I was able to work around it no problem, but figured I'd pay it forward and check --

I wanted to ask whether it would be helpful to convert these characters to something UTF friendly. Feel free to close this issue if you all would rather not; if that would be helpful, please let me know and I'll spin up a PR for it.

Thanks again for making this data public!

colinxfleming commented 6 years ago

Should include an example row:

Lou-000444,20100809,ODONNELL,APRIL,White,21,Female,Louisville,KY,38.1841416,-85.605567,Closed by arrest
y2kbowen commented 6 years ago

Fixed in my #6 pull request. I couldn't get pandas to read the data so I found several and deleted the invalid characters. Pandas reads them fine now

y2kbowen commented 4 years ago

Pandas reports the line where it has the problem reading the data. I used Notepad++ and turned on the feature to allow me to see the special characters and changed them.I remember it being a problem with only a few rows in the data. I checked in the changes in my clone of the data here https://github.com/y2kbowen/data-homicides. You can see the lines that have problems and the changes I made here https://github.com/y2kbowen/data-homicides/commit/99294e0db933fc1b2914549420654d2827df9ccd

I hope this helps

KB

On Mon, Apr 27, 2020 at 6:24 PM msmith2024 notifications@github.com wrote:

I am unable to read in pandas, how did you correct the problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/washingtonpost/data-homicides/issues/4#issuecomment-620287164, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGCS6CIZZSRTFI52YYUTO3ROYH4NANCNFSM4FEOFYCQ .