washingtonpost / data-police-shootings

The Washington Post is compiling a database of every fatal shooting in the United States by a police officer in the line of duty since 2015.
Other
1.12k stars 517 forks source link

Duplicate entries #20

Closed ck1029 closed 4 years ago

ck1029 commented 4 years ago

I just downloaded and browsed this data out of curiosity and I noticed some duplicate entries. They may not affect the overall count by much, but I think WaPo should give these rows some further scrutiny considering that these numbers are cited by many other publications as an accurate accounting of police shootings. Not to mention that each row represents a human life, so it's not exactly some small matter. Do it for journalistic integrity if nothing else.

These four are most likely definite duplicates:

5603 Terry Hasty 2/25/2020 shot gun 56 M Dalzell SC 5572 Terry Hasty 2/25/2020 shot gun 56 M Sumter County SC

5548 Timothy Leroy Harrington 2/14/2020 shot gun 58 M W Polkton NC 5537 Timothy Leroy Harrington 2/14/2020 shot gun 58 M W Polkton NC

5889 David Tylek Atkinson 5/13/2020 shot gun 24 M B Raleigh NC 5835 David Tylek Atkinson 5/13/2020 shot gun 24 M B Raleigh NC

5191 Benjamin Diaz 11/1/2019 shot box cutter 22 M H Alamogordo NM 5150 Benjamin Diaz 11/1/2019 shot sharp object 22 M H Alamogordo NM

These are questionable:

4237 Roderick McDaniel 11/20/2018 shot vehicle 33 M B Magnolia AR 4195 Roderick McDaniel 11/19/2018 shot gun 33 M B Magnolia AR

5515 Miguel Mercado Segura 1/21/2020 shot gun 31 M H Fountain Valley CA 5389 Miguel Mercado Segura 1/20/2020 shot gun 31 M H Fountain Valley CA

5135 Clayton Andrews 10/26/2019 shot gun 40 M W Creek County OK 5128 Clayton Andrews 10/25/2019 shot gun 40 M W Kansas OK

dar2b commented 4 years ago

Duplicates and used through Tableau ?

cswroe commented 4 years ago

5691 & 5721 is another one that is duplicated. I have also reached out for clarification but have yet to hear back.

JohnDovey commented 4 years ago

I’ve copied the csv file to a spreadsheet on Google Sheets and removed the duplicates https://docs.google.com/spreadsheets/d/1FY8f0zEnRGPhlYdBt8OuEbkWQv1dbQ3CFj5QpyC7PEY/edit

JohnDovey commented 4 years ago

Also, created a few graphs using the Google Sheet which you can see at https://datastudio.google.com/s/rkvO9wwssXI

ck1029 commented 4 years ago

I appreciate your efforts, JohnDovey. But what’s really needed is for the Washington Post to correct the source data, rather than netizens such as ourselves publishing corrected versions that no one else will ever see or use. WaPo’s source data is cited by many other publications and think tanks, and for some reason, is treated as gospel.

I downloaded this data last week because I have seen it referenced so often recently due to the current upheaval in my country with regard to protests about police brutality and systemic racism (those two weighty topics are very deserving of debate, but beyond the scope of this rant haha). I just wanted to examine the source data for myself because I am both a data analyst and a skeptic by nature. As a general rule, if source data is made available for public review, then I will review it myself rather than rely on a third party’s interpretation. I do applaud the Washington Post for doing that at least.

So last week, while I was on my lunch break and devouring a bowl of leftover spaghetti that I had made the previous evening (it was delicious, BTW), I pasted the data into an Excel spreadsheet and gave it a quick once-over. I identified those seven duplicates within twenty minutes.

Twenty minutes. Meanwhile, WaPo has been compiling and reporting this data since 2015.

A few days afterward, “cswroe” identified another duplicate entry (5691 & 5721), which I must have overlooked during my initial review because I was too caught up in enjoying the absolute deliciousness of my homemade spaghetti (which, should I have failed to mention, was indisputably superb).

The duplicate noted by cswroe was from 2020, and four of the seven duplicates I mentioned were from 2020. That means that the 2020 numbers reported by the Washington Post are inflated by at least five. A small number to be sure, but either accuracy matters or it doesn’t. And by now, I believe I know where the Washington Post stands. I emailed my findings to policeshootingsfeedback@washpost.com on 06/11/2020, and I assume cswroe did the same. Haven’t heard back, and nothing has been corrected.

This is kind of hilarious because so many reporters blame Donald Trump for America’s distrust of the media, without realizing that they themselves are the cause. And no, I’m not being political…I don’t even like Trump!

I have not conducted a deeper dive into the data at this time (my nerdiness and uncompensated work efforts extend only so far), but it’s possible that there are more duplicates. For example, in my quick evaluation, I checked only names. But what if one row lists a name for “William Smith”, and another row is “Bill Smith”, and they are the same person? And what about all the “TK TK” entries? Those names are placeholders, yet they are never updated. There could be some TK TK duplicates as well.

The Washington Post is one of the biggest publications in the United States, and it also happens to be wholly owned by the richest man on the planet. This newspaper does not lack for resources, and there is no excuse for such shoddy reporting.

Okay…rant over. This post was WAY longer than anticipated.

JohnDovey commented 4 years ago

Thank you CK, appreciate hearing from you. Maybe one day you can let me taste some of that so delicious pasta. I’m in Panama, so if you ever decide to travel to this part of the world feel free to push my button and I’ll treat you to some of the food from this place.

I share some of your passions. Accurate and accessible data is one of those. I first thought I’d do a pull request and update the data then request a merge. I saw though that there are no closed issues and assumed, rightly it seems, that while the data is shared it’s not seen as collaborative effort. I’ve been looking at the school shooting data (https://github.com/washingtonpost/data-school-shootings) and there they seem to have been a little more responsive about updating the data when errors etc have been pointed out.

I’d love to do some normalization, to at least 3rd Normal Form, which would instantly highlight issues with the data, and I considered doing that but decided it was a waste of time.

Creating a copy of the data on Google Sheets was, I decided, a more worthwhile effort because it’s also publicly shareable, I could give you editor rights for example, and it serves as a data source for Google’s Data Studio. I removed the duplicates as pointed out above, and the graphs were instantly updated when the page was refreshed. It seems to me that this is a worthwhile effort. If you want to use the Data Studio to “explore” or create your own reports, you can do so directly from the one data source.

I’m not terribly happy with the graphs I’ve created, nor with all the data, so I’ll work on it a bit and then tout it around a bit. It might even get picked up by the media... they love it when other people do their work for them.

One of the issues I have in the dataset is the fact that some of the data is simply missing, such as race for example. There’s also some weird stuff in there which I’m sure made sense to whoever compiled the data, but there’s no explanation for people like us coming afterwards.

My daughter has just made a truly decadent chocolate pudding, so I’ll leave this here for now to indulge in that.

JohnDovey commented 4 years ago

99D5FCEC-4529-4CD9-B834-79C07DD038F9 Shootings by race having excluded all records where no race was indicated.

cswroe commented 4 years ago

The duplicates are not really that big of an issue. It appears to be equally spread across the race demographics, so it seems to not be nefarious. I am appreciative that at least someone is attempting to keep track and that the data is open and available to everyone. That availability and openness allow us to find duplicates, etc. Just a cursory look at the TK data I am not seeing any duplicates. There are other anomalies that seem to be larger issues with the data, and that may just be that a person is reading an article and inputting the data manually or perhaps not a uniform methodology when inputting the information. For instance, when filtering data for California, it shows 798 records. 436 of those are a threat level of "attack" and 510 show "not fleeing". Does that mean a majority of people were attacking by not fleeing? It makes zero sense. A HUGE factor that is absent from the data is what was the underlying crime committed that resulted in the use of force. Again, I am appreciative of the data and can only see improvements as it becomes standardized across the country and to see folks building data dashboards is awesome. If anything, it shows that there are folks that want the answers for themselves, good or bad. If you really want an eye-opener, look at the Violent Crime Data in relation to population and the information in this dataset for police shootings.

ck1029 commented 4 years ago

Yeah, I don't think the errors are intentional or nefarious, but they could be easily corrected. There are also other issues with the data as you have pointed out.

The data could be made much more informative and reliable with just a little effort. Again, we are talking about the Washington Post here, not some underfunded local rag. The sloppiness of it all bugs me more than anything.

mikekeith52 commented 4 years ago

The identified duplicates do not materially change the data. I have worked for big companies with a lot of resources and have yet to find a perfect dataset or one without some mistakes or data that doesn't make sense. That's why you document any changes you make as you perform your own analyses and defend your own decisions and conclusions using solid methodology. These are standard data management issues; I have yet to see a company that does this perfectly. No need for hysteria.

ck1029 commented 4 years ago

I recognize that as a percentage of the whole, the errors are statistically meaningless. And while perfection in large datasets gathered from multiple sources is never expected, a minimal amount of data validation is. I don’t think that’s a hysterical suggestion. I do understand that the tone of my second post may have been a bit harsh…had a few beers in me at that point. :)

Imperfect data can still be utilized to gather meaningful insights, but in my own line of work, I would never find it acceptable to include in my summary analyses any record from the dataset which I know to be inaccurate. As previously mentioned, we’re talking about a journalistic entity which is supposed to strive for reporting accuracy, and each number reported here represents a human life. I doubt that the family members of those killed think of their loved ones as a rounding error.

Anyway, I just think that this is a worthwhile project that could be made better with minimal effort on WaPo’s part. I do give them credit for starting this project and sharing their data. I only criticize because I care lol.

I’ll step off my high horse now.

angelakdang commented 4 years ago

I attempted to remove duplicates while citing the articles that helped me decide which records to remove. I'm hoping the moderators of this repo will approve my pull request here. The data here is decidedly very public -- we just need a more active moderator.

Perhaps if we got some activity on the PR we could get them to update the data.

That being said, the somewhat "cleaner" data is available from my forked project.

I'm interested in recreating the claims that Sam Harris made in his latest podcast called Can We Pull Back From The Brink, which also references this paper: An Empirical Analysis of Racial Differences in Police Use of Force

His podcast and the paper indicate (paraphrasing) that while black & Hispanic people are more than 50% more likely to experience some form of force in interactions with police. However, when it comes to the violent, officer involved shootings there are no racial differences.

So, while systemic racism exists and persists in America and elsewhere, the specific issue of police shootings may not be race related.

jmuyskens commented 4 years ago

Thanking you for finding these duplicates. This has been resolved by 4b61f9a25982d587deacd4f83272f43d701321b8 and b3ebc7cdd68fc33cc72d5d514e78a568b12ec37f. Note that we are not merging PRs because the CSV in this repo is published downstream of our internal database.