open-austin / project-ideas

:bulb: A place to collect ideas for Open Austin projects
183 stars 25 forks source link

Turn daily arrest data posted by Harris County into a real database #73

Closed safeandfree closed 5 years ago

safeandfree commented 8 years ago

The repo for this project idea is here

I think this is easy, but it is over my head. :)

There is a daily data file posted in Houston here http://www.jims.hctx.net/jimshome/jimsreports/jims1058.txt

This is 24 hours of arrests. The file is replaced every day. We are trying to show that Harris County's jail is full of people they didn't have to arrest.

We pulled it and yesterday in Houston: · A guy was arrested for a bad headlight, no other charge · A guy was arrested for a stop sign violation, no other charge · Eleven people were arrested for poss less than a gram, pg 1 and no other offense (keeping in mind, 1/3rd or more of these are likely to be bad field test victims) · Two people were jailed for theft less than $50 · A woman jailed for a series of things that just look like “piling on” – unclean license plate (that’s a thing?), cardboard over a car window, no insurance · Two guys were arrested for evading with no additional charge · A guy was arrested for evading with a vehicle with no other charge

The list goes on.

And now for the help I need.

I want to automate a daily download of this dataset. I can create an automated task in Access that will grab a file from my desktop and load it to Access.

That's it. I just want to grab this file every day and not have to do it manually. I hope that's pretty simple. I would suggest including it in your data portal project but its Harris County data. So not local. It is really valuable for criminal justice research.

Links to any research/data available/articles

Links here.

What are the next steps (validation, research, coding, design)?

Answer here.

What help is needed at this time?

Answer here.

werdnanoslen commented 7 years ago

Hi all, what's the status on this project? Is it still active? If so, who could keep this issue updated so that others can decide whether to join?

mateoclarke commented 7 years ago

The script is still running and collecting data: https://github.com/open-austin/harris-county-bookings/blob/master/data/2017/2017-04-13.csv

I think it would make sense to label this "Launched" or "status: alpha"

A good next step would be to publish a subset of data to data.world. And think of some research questions that people could dig into.

werdnanoslen commented 7 years ago

Gonna add status: alpha for now until we decide on a labeling scheme in https://github.com/open-austin/iced-coffee/issues/242

laconc commented 6 years ago

Hey all, @mscarey introduced me to this project and I'm interested in helping out. I submitted a PR last night to add a unique identifier for each person and to, in addition to saving to the GitHub repo, push the data to data.world (as datasets aggregated by year.)

PR: https://github.com/open-austin/harris-county-bookings/pull/17

I can also go ahead and create the yearly datasets for all the data that's already been collected.

Yesterday's data in data.world:

screen shot 2017-12-03 at 12 51 57 am

There are a few issues open in the repo but there doesn't seem to be a lot of activity there so I'll repost the highlights here for discussion:

On that last point, it seems like we can just continue as we are for a while. Even though the data can live on data.world, I love the idea of keeping the original sources around. I'm not a fan of keeping it on GitHub though, since if someone wants to make a code change, they'll have to pull the potentially huge data with it. I think GitHub should be exclusively for code and we push the data to S3 or Google Drive or something. We can still maintain access controls for the raw data through those means.

safeandfree commented 6 years ago

I really want you to do the things you've suggested. This data is valuable for so many things. It is time to really dig into it. So thank you!! Please keep posting!


From: Dashiel Lopez Mendez notifications@github.com Sent: Sunday, December 3, 2017 12:02 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Assign Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

Hey all, @mscareyhttps://github.com/mscarey introduced me to this project and I'm interested in helping out. I submitted a PR last night to add a unique identifier for each person and to, in addition to saving to the GitHub repo, push the data to data.world (as datasets aggregated by year.)

PR: open-austin/harris-county-bookings#17https://github.com/open-austin/harris-county-bookings/pull/17

I can also go ahead and create the yearly datasets for all the data that's already been collected.

Yesterday's data in data.world: [screen shot 2017-12-03 at 12 51 57 am]https://user-images.githubusercontent.com/20423536/33528034-7b42e1a6-d820-11e7-8bc7-438d10d2813b.png

There are a few issues open in the repo but there doesn't seem to be a lot of activity there so I'll repost the highlights here for discussion:

On that last point, it seems like we can just continue as we are for a while. Even though the data can live on data.world, I love the idea of keeping the original sources around. I'm not a fan of keeping it on GitHub though, since if someone wants to make a code change, they'll have to pull the potentially huge data with it. I think GitHub should be exclusively for code and we push the data to S3 or Google Drive or something. We can still maintain access controls for the raw data through those means.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-348801700, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnX0Zn3fBdMQ6IOYD5QX8byTusIzQ8ks5s8uIwgaJpZM4JPQ9o.

laconc commented 6 years ago

@safeandfree Will do!

@fileunderjeff You mentioned that you have the data all the way back to 2015, I'd be interested in having access to it. As I mentioned above, names are replaced with a numeric identifier and we'll be reducing the granularity on the addresses. Do you have any other privacy concerns?

safeandfree commented 6 years ago

Can someone help me catch up on where we are with this project? There's data here that I would really, really love to look at.

Kathy Mitchell Texas Criminal Justice Coalition | Grassroots Sentencing Campaign Coordinator 1714 Fortview Road, Suite 104 Austin, Texas 78704 Office: (512) 441-8123 Ext. 116 Fax: (512) 441-4884 www.TexasCJC.orghttp://www.texascjc.org/ | www.facebook.com/TexasCJC<x-msg://4/www.facebook.com/TexasCJC> | www.twitter.com/TexasCJC<x-msg://4/www.twitter.com/TexasCJC>

TCJC works with peers, policy-makers, practitioners, and community members to identify and promote smart justice policies that safely reduce Texas' costly over-reliance on incarceration - creating stronger families, less taxpayer waste, and safer communities. DONATE TODAY!https://co.clickandpledge.com/sp/d1/default.aspx?wid=45713

From: Kathy Mitchell Sent: Sunday, December 03, 2017 7:11 PM To: open-austin/project-ideas; open-austin/project-ideas Cc: Assign Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

I really want you to do the things you've suggested. This data is valuable for so many things. It is time to really dig into it. So thank you!! Please keep posting!


From: Dashiel Lopez Mendez notifications@github.com<mailto:notifications@github.com> Sent: Sunday, December 3, 2017 12:02 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Assign Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

Hey all, @mscareyhttps://github.com/mscarey introduced me to this project and I'm interested in helping out. I submitted a PR last night to add a unique identifier for each person and to, in addition to saving to the GitHub repo, push the data to data.world (as datasets aggregated by year.)

PR: open-austin/harris-county-bookings#17https://github.com/open-austin/harris-county-bookings/pull/17

I can also go ahead and create the yearly datasets for all the data that's already been collected.

Yesterday's data in data.world: [screen shot 2017-12-03 at 12 51 57 am]https://user-images.githubusercontent.com/20423536/33528034-7b42e1a6-d820-11e7-8bc7-438d10d2813b.png

There are a few issues open in the repo but there doesn't seem to be a lot of activity there so I'll repost the highlights here for discussion:

On that last point, it seems like we can just continue as we are for a while. Even though the data can live on data.world, I love the idea of keeping the original sources around. I'm not a fan of keeping it on GitHub though, since if someone wants to make a code change, they'll have to pull the potentially huge data with it. I think GitHub should be exclusively for code and we push the data to S3 or Google Drive or something. We can still maintain access controls for the raw data through those means.

- You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-348801700, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnX0Zn3fBdMQ6IOYD5QX8byTusIzQ8ks5s8uIwgaJpZM4JPQ9o.

laconc commented 6 years ago

Hi @safeandfree there's some work in progress but I'll try and get it done tonight.

Woodley commented 6 years ago

FBI Releases Preliminary 2017 Data on Crime in the United States https://www.justice.gov/opa/pr/fbi-releases-preliminary-2017-data-crime-united-states

On Tue, Jan 23, 2018 at 1:12 PM, Dashiel Lopez Mendez < notifications@github.com> wrote:

Hi @safeandfree https://github.com/safeandfree there's some work in progress but I'll try and get it done tonight.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-austin/project-ideas/issues/73#issuecomment-359897898, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ7i2GFOzIeFj6CrxQl9aQvRCWKForsdks5tNi8kgaJpZM4JPQ9o .

djeraseit commented 6 years ago

I've ran a public arrest records site (case lookup) for Harris County (Buckfumble.com) since 2011. I never understood why researchers want to obscure data that the government already publishes citing "privacy concerns".

The approach you guys are doing is all wrong. The data is already published in CSV format, you're just storing it daily and messing it up.

What I can tell you is that Texas Penal Code 32.51 states that the presumption is fraud if you are not a company and store this information in a database.

mscarey commented 6 years ago

Thanks for reminding me of this issue, @djeraseit. I decided there was no need to leave anonymization issue pending any longer, so I just went for the crude solution of deleting the address columns without adding IDs or any kind of substitute geodata. I'm not aware of anyone developing anything with this dataset, so I don't see a reason to add new features for hypothetical users. Anyone who needs the original data can still contact Open Austin. Note that the last day the scraper ran was June 29, 2018.

For what it's worth, the organizations that have been poking at the dataset are business entities, so I'd say the project has been not presumably fraudulent, as well as not actually fraudulent.

mscarey commented 5 years ago

Closing because I don't think we should onboard anyone else onto this project, but we can still do further cleanup.