Turn daily arrest data posted by Harris County into a real database

safeandfree commented 8 years ago

The repo for this project idea is here

I think this is easy, but it is over my head. :)

There is a daily data file posted in Houston here http://www.jims.hctx.net/jimshome/jimsreports/jims1058.txt

This is 24 hours of arrests. The file is replaced every day. We are trying to show that Harris County's jail is full of people they didn't have to arrest.

We pulled it and yesterday in Houston: · A guy was arrested for a bad headlight, no other charge · A guy was arrested for a stop sign violation, no other charge · Eleven people were arrested for poss less than a gram, pg 1 and no other offense (keeping in mind, 1/3rd or more of these are likely to be bad field test victims) · Two people were jailed for theft less than $50 · A woman jailed for a series of things that just look like “piling on” – unclean license plate (that’s a thing?), cardboard over a car window, no insurance · Two guys were arrested for evading with no additional charge · A guy was arrested for evading with a vehicle with no other charge

The list goes on.

And now for the help I need.

I want to automate a daily download of this dataset. I can create an automated task in Access that will grab a file from my desktop and load it to Access.

That's it. I just want to grab this file every day and not have to do it manually. I hope that's pretty simple. I would suggest including it in your data portal project but its Harris County data. So not local. It is really valuable for criminal justice research.

Links to any research/data available/articles

Links here.

What are the next steps (validation, research, coding, design)?

Answer here.

What help is needed at this time?

Answer here.

werdnanoslen commented 8 years ago

Is the filename always "jims1058.txt"? If so, you could use python's urllib.urlretrieve or wget running as a cron job.

If the file name changes, what is the pattern? I presume the 1058 in "jims1058.txt" changes depending on the date? Once the pattern is known, that can be used to change what url the download script points to.

safeandfree commented 8 years ago

The filename is always jims1058.txt. I will look at these ideas. I’m not a programmer but I’m not afraid to muck around a little. ☺

safeandfree commented 8 years ago

OK I officially don’t understand. ☺ I have looked at both options, and in my complete ignorance I like the urllib.urlretrieve option…but now what?

Woodley commented 8 years ago

FYI, I noticed this htm version that is easier to read with page breaks. http://www.jims.hctx.net/jimshome/jimsreports/jims1058.htm

But, I understand you just want the raw data to merge into your database.

Bookings and Releases within last 24 hours, This one does not have a jims503.txt http://www.jims.hctx.net/jimshome/jimsreports/jims503.htm

These reports do not have context. I also notice that the arrest dates and booking dates may be off by a few days. I think that the booking date is what these reports are based on. I also noticed that some people have different arrest dates for the same person. So is it possible to get lost in the system if you get arrested but not booked?

-John

On Tue, Jul 19, 2016 at 10:21 AM, safeandfree notifications@github.com wrote:

OK I officially don’t understand. ☺ I have looked at both options, and in my complete ignorance I like the urllib.urlretrieve option…but now what?

Kathy Mitchell Texas Criminal Justice Coalition | Grassroots Sentencing Campaign Coordinator 1714 Fortview Road, Suite 104 Austin, Texas 78704 Office: (512) 441-8123 Ext. 116 Fax: (512) 441-4884 www.TexasCJC.orghttp://www.texascjc.org/ | www.facebook.com/TexasCJC x-msg://4/www.facebook.com/TexasCJC | www.twitter.com/TexasCJC x-msg://4/www.twitter.com/TexasCJC

TCJC works with peers, policy-makers, practitioners, and community members to identify and promote smart justice policies that safely reduce Texas’ costly over-reliance on incarceration – creating stronger families, less taxpayer waste, and safer communities. DONATE TODAY!< https://co.clickandpledge.com/sp/d1/default.aspx?wid=45713>

From: Andrew Nelson [mailto:notifications@github.com] Sent: Monday, July 18, 2016 8:45 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Author Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

Is the filename always "jims1058.txt"? If so, you could use python's urllib.urlretrievehttp://stackoverflow.com/a/22776 or wget running as a cron job< https://www.mattcutts.com/blog/how-to-fetch-a-url-with-curl-or-wget-silently/>.

If the file name changes, what is the pattern? I presume the 1058 in "jims1058.txt" changes depending on the date? Once the pattern is known, that can be used to change what url the download script points to.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/open-austin/project-ideas/issues/73#issuecomment-233508430>, or mute the thread< https://github.com/notifications/unsubscribe-auth/ARrnX4-w2KHQSeg0hGUSdX1gW-gK3HBZks5qXCwogaJpZM4JPQ9o>.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/open-austin/project-ideas/issues/73#issuecomment-233667578, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ7i2FJqH4Ab6BJN3AVxylStPUBU7Z2Jks5qXOuCgaJpZM4JPQ9o .

colbywhite commented 8 years ago

@werdnanoslen The URL is the same. I was looking at it last night and again this morning. Take a look at the http://www.jims.hctx.net/jimshome/jimsreports/ directory. It seems that jims1058.txt gets updates everyday around 2:30am. I already have a ruby class that pulls jims1058.txt, parses it (using the native CSV lib if you're wondering), and holds the information. That part isn't too difficult. The next step would be to dump it into a DB of some sort.

EDIT: @safeandfree, you might also want to look at http://www.jims.hctx.net/jimshome/jimsreports/. As @Woodley is pointing out, there might be some other data you might find interesting. Although, based on the timestamps, I think jims1058.txt is the only one that is getting updated daily.

colbywhite commented 8 years ago

In addition to putting the info in a DB, which will allow you to search and query against the info, you probably want to save the actual file somewhere for reference later (S3 maybe?). That just allows you to rebuild the DB from scratch if you want/need.

Also, just for the sake of having it recorded since it took a while for me to make the connection last night, JIMS stands for Justice Information and Management System. (It's not a guy named Jim 😉 ). I assumed the 1058 part in the file name is reference to something. A law? A form? From the technical aspect, that doesn't change anything I guess. Just context.

safeandfree commented 8 years ago

Fascinating. I didn’t notice the gap between arrest and booking. Yes, people have gotten “lost” although it is supposedly rare. Unless the police officer is just driving around with the arrestee in his car for days, the person is probably stuck at some point in the booking process. These kinds of questions are among the many, many things this data will start to allow us to investigate.

Kathy Mitchell

Woodley commented 8 years ago

@ Coby, Speaking of JIMs, that is making me hungry for Jims Food.

@safeandfree I think by law they can hold you for 48 hours before charging or releasing you.

On Tue, Jul 19, 2016 at 10:49 AM, safeandfree notifications@github.com wrote:

Fascinating. I didn’t notice the gap between arrest and booking. Yes, people have gotten “lost” although it is supposedly rare. Unless the police officer is just driving around with the arrestee in his car for days, the person is probably stuck at some point in the booking process. These kinds of questions are among the many, many things this data will start to allow us to investigate.

Kathy Mitchell Texas Criminal Justice Coalition | Grassroots Sentencing Campaign Coordinator 1714 Fortview Road, Suite 104 Austin, Texas 78704 Office: (512) 441-8123 Ext. 116 Fax: (512) 441-4884 www.TexasCJC.orghttp://www.texascjc.org/ | www.facebook.com/TexasCJC x-msg://4/www.facebook.com/TexasCJC | www.twitter.com/TexasCJC x-msg://4/www.twitter.com/TexasCJC

TCJC works with peers, policy-makers, practitioners, and community members to identify and promote smart justice policies that safely reduce Texas’ costly over-reliance on incarceration – creating stronger families, less taxpayer waste, and safer communities. DONATE TODAY!< https://co.clickandpledge.com/sp/d1/default.aspx?wid=45713>

From: Woodley [mailto:notifications@github.com] Sent: Tuesday, July 19, 2016 10:28 AM To: open-austin/project-ideas Cc: Kathy Mitchell; Author Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

FYI, I noticed this htm version that is easier to read with page breaks. http://www.jims.hctx.net/jimshome/jimsreports/jims1058.htm

But, I understand you just want the raw data to merge into your database.

Bookings and Releases within last 24 hours, This one does not have a jims503.txt http://www.jims.hctx.net/jimshome/jimsreports/jims503.htm

These reports do not have context. I also notice that the arrest dates and booking dates may be off by a few days. I think that the booking date is what these reports are based on. I also noticed that some people have different arrest dates for the same person. So is it possible to get lost in the system if you get arrested but not booked?

-John

On Tue, Jul 19, 2016 at 10:21 AM, safeandfree <notifications@github.com mailto:notifications@github.com> wrote:

OK I officially don’t understand. ☺ I have looked at both options, and in my complete ignorance I like the urllib.urlretrieve option…but now what?

Kathy Mitchell Texas Criminal Justice Coalition | Grassroots Sentencing Campaign Coordinator 1714 Fortview Road, Suite 104 Austin, Texas 78704 Office: (512) 441-8123 Ext. 116 Fax: (512) 441-4884 www.TexasCJC.orghttp://www.texascjc.org/<http://www.TexasCJC.org %3chttp:/www.texascjc.org/> | www.facebook.com/TexasCJC< http://www.facebook.com/TexasCJC> x-msg://4/www.facebook.com/TexasCJC | www.twitter.com/TexasCJC< http://www.twitter.com/TexasCJC> x-msg://4/www.twitter.com/TexasCJC

TCJC works with peers, policy-makers, practitioners, and community members to identify and promote smart justice policies that safely reduce Texas’ costly over-reliance on incarceration – creating stronger families, less taxpayer waste, and safer communities. DONATE TODAY!< https://co.clickandpledge.com/sp/d1/default.aspx?wid=45713>

From: Andrew Nelson [mailto:notifications@github.com] Sent: Monday, July 18, 2016 8:45 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Author Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

Is the filename always "jims1058.txt"? If so, you could use python's urllib.urlretrievehttp://stackoverflow.com/a/22776 or wget running as a cron job<

https://www.mattcutts.com/blog/how-to-fetch-a-url-with-curl-or-wget-silently/>.

If the file name changes, what is the pattern? I presume the 1058 in "jims1058.txt" changes depending on the date? Once the pattern is known, that can be used to change what url the download script points to.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<

https://github.com/open-austin/project-ideas/issues/73#issuecomment-233508430>,

or mute the thread<

https://github.com/notifications/unsubscribe-auth/ARrnX4-w2KHQSeg0hGUSdX1gW-gK3HBZks5qXCwogaJpZM4JPQ9o>.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/open-austin/project-ideas/issues/73#issuecomment-233667578>,

or mute the thread < https://github.com/notifications/unsubscribe-auth/AQ7i2FJqH4Ab6BJN3AVxylStPUBU7Z2Jks5qXOuCgaJpZM4JPQ9o>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/open-austin/project-ideas/issues/73#issuecomment-233669841>, or mute the thread< https://github.com/notifications/unsubscribe-auth/ARrnX9PQkqHHbj_cUv0bQ49uK-tsbHyUks5qXO0bgaJpZM4JPQ9o>.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-austin/project-ideas/issues/73#issuecomment-233676646, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ7i2IS4hAUzapTT6GqO0Ouo67_uFEDuks5qXPH1gaJpZM4JPQ9o .

werdnanoslen commented 8 years ago

@colbywhite I think @safeandfree said they can load the files as-is into Access, since the it seems to be a database format delimited by semicolons and tabs.

@safeandfree have you used Ruby before? What is the OS on the computer that you'd like the files to be downloaded to?

colbywhite commented 8 years ago

Ah, I see. Didn't realize that was in an Access format. Sweet. That makes it even easier then. I'm not familiar with Access, but I assume it has a remote import feature? A cron job that just shoves the file into Access should suffice? Doesn't matter which language then. Whichever has the better library for importing into Access I guess.

safeandfree commented 8 years ago

Access does have a remote import feature. I tried to use it to go directly to that page and got all sorts of errors. ☺

safeandfree commented 8 years ago

I hope you are getting these replies. I should go into git…

Yes, I’m at a small nonprofit with the basic Microsoft tools on a Windows 7 operating system. I can work on something that is hosted remotely too.

colbywhite commented 8 years ago

@safeandfree, if you're able to get a hosted Access instance, then that would definitely make this even simpler. Load up the file, shoot it into your Access instance. I, unfortunately, have no experience with hosted Access instances. So I wouldn't know where to look. Somebody else have input on that?

Assuming you get that instance, would you want some kind of website on top of it in order to query it? Or do you plan on just querying the Access DB directly to get what you need out of it?

mateoclarke commented 8 years ago

So if it's important for the data to be accessible from the desktop MS Access application on Kathy's machine, we might be able to run a Microsoft Azure SQL database and configure her Access app to pull data from the Azure cloud.

Linking Access Applications to SQL Server - Azure SQL DB Office Support: Link to SQL Server data

We have credits, hard to decipher how much but I think like $130 worth of credits. Their basic plan is $5 a month for 2GB, then $15 for 250GB, so we would have credits to get us through 1-2 years depending on the size of these files and could probably ask for more from contacts at Microsoft that know Code for America. I was expecting to find an Access in the Cloud type of service but that is probably part of Office 365, not Azure.

This is what the Azure SQL web UI looks like

Alternatively, we could run a Microsoft SQL Server on AWS where we don't have any known credits caps. Microsoft SQL Server on Amazon RDS

Looping in @luqmaan & @gusIreland who manage our hosting resources.

safeandfree commented 8 years ago

This sounds awesome!

werdnanoslen commented 8 years ago

After Azure credits run out, I will see what I can do about providing access to bluemix, which I will likely be working on at ibm.

mateoclarke commented 8 years ago

yeah, good point. An Azure SQL database should be simple to migrate to whatever hosting platform necessary. (If it isn't simple we shouldn't use it.)

luqmaan commented 8 years ago

I don't think this project needs a database.

A database is a lot of work to setup, is not open, and is not easy to access.

A simpler and more open solution is to do what we did with the construction-permits project:

A script runs once a day
The script calls http://www.jims.hctx.net/jimshome/jimsreports/jims1058.txt and converts it to CSV
The script stores the CSV file on GitHub: https://github.com/open-austin/construction-permits/blob/master/permits/github.py

The data is searchable: https://github.com/open-austin/construction-permits/search?utf8=%E2%9C%93&q=7east

The data is browsable: https://github.com/open-austin/construction-permits/tree/master/data

Because the data is in CSV format:

its easy load into ElasticSearch or some other database if we need fancier searching.
its easy to use in with pandas, d3, andother tools

safeandfree commented 8 years ago

I totally want this data to be someplace where people can get it for lots of purposes. A primary purpose for me is to be able to surface patterns and ID people who are the victim of patterns of over policing. Yesterday, I took several days of the data, merged it into a single data file, and then ran cross tabs in order to ID people who were arrested on a single class C misdemeanor charge so I could write them a letter asking for more information about what happened and why they were actually jailed on an offense that does not have jail as an available punishment (Class C misdemeanors are "fine-only" offenses). A search function for everyone in the file with a class C misdemeanor would pull up a huge number of people who were also charged with other offenses in the same arrest. This file has a separate row for every charge, so the same person might be in there eight times if they were arrested on eight charges. Does that information change the analysis?

From: Luqmaan Dawoodjee notifications@github.com Sent: Wednesday, July 20, 2016 9:44:31 AM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

I don't think this project needs a database.

A database is a lot of work to setup, is not open, and not easy to access.

A simpler and more option solution is to do what we did with the construction-permitshttps://github.com/open-austin/construction-permits project:

Script runs once a day
Calls http://www.jims.hctx.net/jimshome/jimsreports/jims1058.txt and converts it to CSV
Stores the CSV file on GitHub: https://github.com/open-austin/construction-permits/blob/master/permits/github.py

The data is searchable: https://github.com/open-austin/construction-permits/search?utf8=%E2%9C%93&q=7east

The data is browsable: https://github.com/open-austin/construction-permits/tree/master/data

Because the data is in CSV format:

its easy load into ElasticSearch or some other database if we need fancier searching.
its easy to use in with pandashttp://pandas.pydata.org/, d3https://d3js.org/, andother tools

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-233971154, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnXz_zKYFk36Sz2qJDgg1LbTNVLDwHks5qXjRPgaJpZM4JPQ9o.

mateoclarke commented 8 years ago

I know this data is already published online, but how do we feel about dumping a huge list of names of people who have been booked along with their birthdays into a publicly viewable Github repo as CSVs. Is there a concern for their privacy? This could be one argument for why we would want to store it on a DB and only give ppl access if they request it.

If we're ok with storing this data publicly, then it seems like writing data through the Github API is a good first step. And from there maybe we will discover functionality that requires a real SQL db.

safeandfree commented 8 years ago

Not only is this data public already, but there are some unsavory web companies that actually post it already. I'm not sure we're making people's privacy any worse by putting this on git for now. In the long run, this is likely to be a big issue at the leg this session because, yes, its a problem from a privacy standpoint.

From: Mateo Clarke notifications@github.com Sent: Wednesday, July 20, 2016 10:23:23 AM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

I know this data is already published online, but how do we feel about dumping a huge list of names of people who have been booked along with their birthdays into a publicly viewable Github repo as CSVs. Is there a concern for their privacy? This could be one argument for why we would want to store it on a DB and only give ppl access if they request it.

If we're ok with storing this data publicly, then it seems like writing data through the Github API is a good first step. And from there maybe we will discover functionality that requires a real SQL db.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-233983808, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnX0HkPZXa9zYcSPkSr8kZtJdjzjnQks5qXj1rgaJpZM4JPQ9o.

colbywhite commented 8 years ago

@luqmaan, I agree if we can get away with not setting up a database, we should. But do you think that kind of search functionality fulfills the type of searching @safeandfree is looking for? For instance, using your construction permit example, I can search for how many Sign Permits have been issued (5,866), but I can't figure out how many Sign Permits were issued for Spicewood Springs using the github search alone. I also can't figure out how many Sign Permits have been issued since 2000 using the github search alone. I can't compare the amount of Sign Permits in 2000 to 1999.

To figure those out, I would have to download all the csvs and load them into my own database. So my question goes to @safeandfree, is that a valid solution? Could we set something up to start downloading the jims1058.txt files and storing them here? And from there you can load them into whichever personal DB you decide is necessary. In many ways, that leaves you in a similar position you are in now, except now you would have some historical data, as opposed to just one day's worth. You also wouldn't have to update you DB everyday. Since the jims1058.txt file is being stored here everyday, you can feel sure that you're not missing a day and just update your personal DB whenever you decide you want new data. (Maybe we can include a script or two to make that easier.)

You know, as I type that out, that solution is growing on me. It does put some of the burden on the person looking to make in-depth conclusions based on the data - i.e. @safeandfree.

As for the privacy question, I think you guys would be better equipped to answer that than me. But I would point out that, using on that construction permits repo, I was able to surmise that two people named Larry Butler and Carol Ann Sayle remodeled a home on Lyons road in 1980. So I think you guys have already staked out a position somewhere on the privacy spectrum. This would just seem to follow that position.

safeandfree commented 8 years ago

Hey all, yes to just getting a script to start pulling the data every day without me having to remember. That would be AWESOME.

But also, is there a way to have that script add the new file every day to one spreadsheet instead of making separate files? Ideally I would like to have a year of data slowly accumulate. When its a daily file, it gets a bit rough to manually merge them all together.

And finally, yes, if we can just get it into a big file that can be downloaded as .csv I can upload it to a personal database on my desktop. Because I do need to do complex things. Today for example I needed to know how many people who were arrested for evading arrest had no other related charge. So they were evading arrest for what exactly? I also want to be able to eventually map where people live. Are people who are arrested for evading arrest (with no other charge) disproportionately from certain neighborhoods?

This is very rich data and is going to reveal a great deal about the front end of policing that no one actually knows now. Or, well, some people know it very well, but not lawmakers or city council members.

From: Colby M. White notifications@github.com Sent: Thursday, July 21, 2016 6:30 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

@luqmaanhttps://github.com/luqmaan, I agree if we can get away with not setting up a database, we should. But do you think that kind of search functionality fulfills the type of searching @safeandfreehttps://github.com/safeandfree is looking for? For instance, using your construction permit example, I can search for how many Sign Permits have been issued (5,866https://github.com/open-austin/construction-permits/search?utf8=%E2%9C%93&q=Sign+Permit&type=Code), but I can't figure out how many Sign Permits were issued for Spicewood Springshttps://github.com/open-austin/construction-permits/search?utf8=%E2%9C%93&q=Sign+Permit+Spicewood+Springs&type=Code using the github search alone. I also can't figure out how many Sign Permits have been issued since 2000 using the github search alone. I can't compare the amount of Sign Permits in 2000 to 1999.

To figure those out, I would have to download all the csvs and load them into my own database. So my question goes to @safeandfreehttps://github.com/safeandfree, is that a valid solution? Could we set something up to start downloading the jims1058.txt files and storing them here? And from there you can load them into whichever personal DB you decide is necessary. In many ways, that leaves you in a similar position you are in now, except now you would have some historical data, as opposed to just one. You also wouldn't have to update you DB everyday. Since the jims1058.txt file is being stored here everyday, you can feel sure that you're not missing a day and just update your personal DB whenever you decide you want new data. (Maybe we can include a script or two to make that easier.)

You know, as I type that out, that solution is growing on me. It does put some of the burden on the person looking to make in-depth conclusions based on the data - i.e. @safeandfreehttps://github.com/safeandfree.

As for the privacy question, I think you guys would be better equipped to answer that than me. But I would point out that, using on that construction permits repo, I was able to surmise that two people named Larry Butler and Carol Ann Sayle remodeled a home on Lyons road in 1980https://github.com/open-austin/construction-permits/blob/master/data/1980/1980-01-02.csv. So I think you guys have already staked out a position somewhere on the privacy spectrum. This would just seem to follow that position.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-234414750, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnX7OZrpBxwD4M1f1lAUo5zzjBxaMWks5qYAESgaJpZM4JPQ9o.

Woodley commented 8 years ago

I am not an attorney and nothing in this can be construed as legal advice. Having said that...

If you are concerned about privacy you should contact a lawyer. You could also contact the Attorney General's Office to see what they thin about privacy. The following are publicly available resources;

GOVERNMENT CODE TITLE 5. OPEN GOVERNMENT; ETHICS SUBTITLE A. OPEN GOVERNMENT CHAPTER 552. PUBLIC INFORMATION SUBCHAPTER A. GENERAL PROVISIONS http://www.statutes.legis.state.tx.us/Docs/GV/htm/GV.552.htm

Texas Attorney General - Public Information Act Handbook https://www.texasattorneygeneral.gov/files/og/publicinfo_hb.pdf

http://www.open-public-records.com/texas_public_records.htm

On Thu, Jul 21, 2016 at 8:14 PM, safeandfree notifications@github.com wrote:

Hey all, yes to just getting a script to start pulling the data every day without me having to remember. That would be AWESOME.

But also, is there a way to have that script add the new file every day to one spreadsheet instead of making separate files? Ideally I would like to have a year of data slowly accumulate. When its a daily file, it gets a bit rough to manually merge them all together.

And finally, yes, if we can just get it into a big file that can be downloaded as .csv I can upload it to a personal database on my desktop. Because I do need to do complex things. Today for example I needed to know how many people who were arrested for evading arrest had no other related charge. So they were evading arrest for what exactly? I also want to be able to eventually map where people live. Are people who are arrested for evading arrest (with no other charge) disproportionately from certain neighborhoods?

This is very rich data and is going to reveal a great deal about the front end of policing that no one actually knows now. Or, well, some people know it very well, but not lawmakers or city council members.

From: Colby M. White notifications@github.com Sent: Thursday, July 21, 2016 6:30 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

@luqmaanhttps://github.com/luqmaan, I agree if we can get away with not setting up a database, we should. But do you think that kind of search functionality fulfills the type of searching @safeandfree< https://github.com/safeandfree> is looking for? For instance, using your construction permit example, I can search for how many Sign Permits have been issued (5,866< https://github.com/open-austin/construction-permits/search?utf8=%E2%9C%93&q=Sign+Permit&type=Code>), but I can't figure out how many Sign Permits were issued for Spicewood Springs< https://github.com/open-austin/construction-permits/search?utf8=%E2%9C%93&q=Sign+Permit+Spicewood+Springs&type=Code> using the github search alone. I also can't figure out how many Sign Permits have been issued since 2000 using the github search alone. I can't compare the amount of Sign Permits in 2000 to 1999.

To figure those out, I would have to download all the csvs and load them into my own database. So my question goes to @safeandfree< https://github.com/safeandfree>, is that a valid solution? Could we set something up to start downloading the jims1058.txt files and storing them here? And from there you can load them into whichever personal DB you decide is necessary. In many ways, that leaves you in a similar position you are in now, except now you would have some historical data, as opposed to just one. You also wouldn't have to update you DB everyday. Since the jims1058.txt file is being stored here everyday, you can feel sure that you're not missing a day and just update your personal DB whenever you decide you want new data. (Maybe we can include a script or two to make that easier.)

You know, as I type that out, that solution is growing on me. It does put some of the burden on the person looking to make in-depth conclusions based on the data - i.e. @safeandfreehttps://github.com/safeandfree.

As for the privacy question, I think you guys would be better equipped to answer that than me. But I would point out that, using on that construction permits repo, I was able to surmise that two people named Larry Butler and Carol Ann Sayle remodeled a home on Lyons road in 1980< https://github.com/open-austin/construction-permits/blob/master/data/1980/1980-01-02.csv>. So I think you guys have already staked out a position somewhere on the privacy spectrum. This would just seem to follow that position.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/open-austin/project-ideas/issues/73#issuecomment-234414750>, or mute the thread< https://github.com/notifications/unsubscribe-auth/ARrnX7OZrpBxwD4M1f1lAUo5zzjBxaMWks5qYAESgaJpZM4JPQ9o

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-austin/project-ideas/issues/73#issuecomment-234429631, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ7i2K5BoA4ywgPAdy1pPvT5nfByZiiGks5qYBmPgaJpZM4JPQ9o .

safeandfree commented 8 years ago

Legally, this is completely public information. That maybe needs to change, and there will be some discussion during the next legislative session about the privacy rights of people who have been arrested, booked and charged with crimes but are not yet “guilty” because they are pre-trial. For now, there is a significant research benefit to making this data available to the criminal justice reform movement so we can study things like arrests for offenses where jail time is not a punishment, or arrests for offenses like “evading arrest” which make no sense as stand alone charges, or arrests for low level drug possession offenses (which are based on unreliable field tests.) And there is a significant organizing component for the movement as well. I am actually contacting people so they can vouch for their experience in the political process.

Hopefully, all that helps get us to the best technical solution?

mateoclarke commented 8 years ago

@colbywhite, Is it cool if I go ahead and create a repo for this under our github org, "open-austin" and set you up as admin?

What should we name the repo? open-austin/jims Or something more descriptive? open-austin/harris-county-bookings

colbywhite commented 8 years ago

open-austin/harris-county-bookings is good.

FYI: A quick, casual, less-than-five-minutes Google search indicates that there may be some other counties using JIMS (Knox, Tennessee seems to use something like it), but for now, we're focused on Harris County. If someone from another county/city wants our help with it, then we can make a new repo with the common JIMS code. But open-austin/harris-county-bookings is definitely good for now.

mateoclarke commented 8 years ago

https://github.com/open-austin/harris-county-bookings

luqmaan commented 8 years ago

Should we consider modifying some of the columns that personally identify people?

Change name to a number
Change birth date to birth year

werdnanoslen commented 8 years ago

Perhaps just for this repo, leave off the unnecessary columns for development, then add them back for production (safeandfree's computer). Just so that we don't improve the SEO of someone's records.

fileunderjeff commented 8 years ago

I have been scraping this data since May 2015. Happy to make it available, provided there are privacy safeguards for the names.

luqmaan commented 8 years ago

Nice @fileunderjeff!

Do you mind turning your code and data into a repo? Or opening a PR to https://github.com/open-austin/harris-county-bookings? Whichever one works best for you.

fileunderjeff commented 8 years ago

@luqmaan no problem! Let me confer with some local attorneys first, but I am happy to put together the database. Right now, my scraper is pretty rudimentary. I'd like to work on it a little more, and maybe build an API. Stay tuned!

luqmaan commented 8 years ago

Excellent.

Before you do a bunch of work to design and build an API, lets make things simple. Just a bunch of CSV files in a github repo.

CSV files in a repo have a bunch of advantages over an API, specifically:

easy to browse https://github.com/open-austin/construction-permits/tree/master/data
easy to export/download
easy to use in pandas
no need to learn how to use an API
no need to report/fix bugs in an API
can be forked

fileunderjeff commented 8 years ago

@luqmaan I am with you on the ease of a repo, but I am not going to release the raw files without consulting a lawyer first. This is something we are already working on for 2 other projects in Houston.

safeandfree commented 8 years ago

WOW!! Can you give me a call?

Kathy Mitchell Texas Criminal Justice Coalition | Grassroots Sentencing Campaign Coordinator 1714 Fortview Road, Suite 104 Austin, Texas 78704 Office: (512) 441-8123 Ext. 116 Fax: (512) 441-4884 www.TexasCJC.orghttp://www.texascjc.org/ | www.facebook.com/TexasCJCx-msg://4/www.facebook.com/TexasCJC | www.twitter.com/TexasCJCx-msg://4/www.twitter.com/TexasCJC

TCJC works with peers, policy-makers, practitioners, and community members to identify and promote smart justice policies that safely reduce Texas’ costly over-reliance on incarceration – creating stronger families, less taxpayer waste, and safer communities. DONATE TODAY!https://co.clickandpledge.com/sp/d1/default.aspx?wid=45713

From: Jeff Reichman [mailto:notifications@github.com] Sent: Friday, July 22, 2016 10:43 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

I have been scraping this data since May 2015. Happy to make it available, provided there are privacy safeguards for the names.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-234696966, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnX1zUNkNckwoUqX7WlmiFjC3GH2_Pks5qYY3QgaJpZM4JPQ9o.

colbywhite commented 8 years ago

I was able to carve away some time to do the initial cut. Will work on committing the file next. Shall we close this ticket for now and move the convo over?

As far as the privacy goes, I'm just going to follow in the permit repo's footsteps until a decision is made on how Open Austin wants to handle privacy in these sorts of situations. But I am eager to hear what y'all come up with. (And maybe that convo should be split into a different ticket as well?)

luqmaan commented 8 years ago

@colbywhite No, we should keep the issue open. I think there's still some discussion going on.

Also, we still need to figure out if @fileunderjeff is opening a PR to add the data he's already collected to https://github.com/open-austin/harris-county-bookings or if he'll be creating his own repo.

fileunderjeff commented 8 years ago

@colbywhite @luqmaan i'll be creating a repo out of Sketch City, but only after I talk to a lawyer. I urge you all to consider the privacy issues at stake here. Arrests are not adjudications. They can be expunged, found not guilty, etc. The JIMS file also has a ton of personally identifying information that needs to be reviewed by a lawyer prior to publishing. So it is not happening overnight. Thank you for your patience.

fileunderjeff commented 8 years ago

@luqmaan also happy to initiate a PR from Open Austin. Doesn't matter to me.

fileunderjeff commented 8 years ago

Other things we are planning to do with this data (in case anyone wants to join in!):

Assess the accuracy of "white" and "hispanic" classifiers by running names through http://www.textmap.com/ethnicity/
Looking at arrests where possession of marijuana 0-2oz is the prompting charge, and assessing the FTE hours associated with processing these arrests.

This data is really interesting and there's a lot that can be done with it.

mateoclarke commented 8 years ago

@fileunderjeff, Have you used data.world yet? We have invites we can share. That might be a good place to host the data once personal identifiers are removed.

You can host a dataset for free Public or Private, just like Github. You can control who has access to view, query, and download the data. Maybe that's a solution to the sharing question. They don't have a public API so uploading would be a manual process, but it sounds like the data you have already collected would give Kathy a start on research.

I take the privacy question seriously.

A situation I want to avoid is one where its harder for someone to get a job because:

an employer can more easily search via Full Names and DOB through booking records and create prejudice against somebody without evidence of guilt.

Each row in this DB represents a fragile moment in time when a person lost control of their liberty to the State before being able to defend their innocence. For good reasons, this data is already public. For bad reasons, it already perpetuates inequalities in our justice system.

do no harm...

colbywhite commented 8 years ago

FYI: The work in open-austin/harris-county-bookings is ready to be deployed. Then this could be closed, correct?

safeandfree commented 8 years ago

Hey all...I have somehow missed the updates here. Is this the work that Jeff in Houston has been doing or is this the work we started (and completed?) here at Open Austin? I got a little confused about who was on first. [😊]

From: Colby M. White notifications@github.com Sent: Wednesday, August 17, 2016 9:52:54 PM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

FYI: The work in open-austin/harris-county-bookingshttps://github.com/open-austin/harris-county-bookings is ready to be deployed. Then this could be closed, correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-240610186, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnX6slDfT04MiA96ANwFoJrH-fpfm9ks5qg8kGgaJpZM4JPQ9o.

colbywhite commented 8 years ago

@safeandfree I've been completed the work from a Open Austin perspective for a while. Just waiting on some credential information so I can deploy it and start it up. You'll be able to see the data in the open-austin/harris-county-bookings repo's data directory when it's running.

And it sounded like @fileunderjeff will be doing his work under the @sketch-city group, but getting a lawyer's opinion first. The work I did scrubs the personal data out of what is kept.

mscarey commented 8 years ago

I haven't looked at this project before and I don't really know how it works, but it doesn't look to me like it's handling privacy correctly yet. The file names with the .accdb extension still have the arrestees' names. I can understand collecting that data, but I don't think Github is the right place to store it. The .csv files have no names, but they still have addresses, which are also personally identifying. It looks to me like these are home addresses, not the address where the arrest happened (the arrestee I'm looking at was probably not inside an apartment when she got caught for driving without a license, yet the address field for the arrest specifies a unit of an apartment building).

My suggestion is to not store the .accdb files on Github at all, and drop at least these additional columns from the .csv files: ADDRESS NUMBER, ADDRESS PREFIX, ADDRESS STREET, ADDRESS SUFFIX, ADDRESS ALI. You might think about generating arbitrary ID numbers corresponding to the arrested person, or some other solution to make it clear whether a bunch of people are being arrested or one person is being arrested for numerous crimes. But what's important in the short term is to fix the privacy issue.

colbywhite commented 8 years ago

I think what you're seeing regarding the difference between the accdb and the csv is a bug. I'll investigate that further during the hack night this week. Not sure why those are being treated differently. Good catch.

In regards to the addresses, those just weren't on the list of things to scrub in open-austin/harris-county-bookings#4. But you're probably right. Those should probably be scrubbed as well. I like the arbitrary ID number idea as well.

mateoclarke commented 8 years ago

Thinking back to the problem that @safeandfree is trying to solve...

A primary purpose for me is to be able to surface patterns and ID people who are the victim of patterns of over policing. Yesterday, I took several days of the data, merged it into a single data file, and then ran cross tabs in order to ID people who were arrested on a single class C misdemeanor charge so I could write them a letter asking for more information about what happened and why they were actually jailed on an offense that does not have jail as an available punishment (Class C misdemeanors are "fine-only" offenses).

I think getting @safeandfree set up with the full unabridged dataset to use either locally on her computer, or a privately accessible server (like the Azure SQL db) is what is needed.

@safeandfree will you be at the Civic Hack Night tomorrow? Just want to make sure what @colbywhite has been working on gets you a solution that works for what you are trying to accomplish. And thanks to @mscarey for speaking up about privacy concerns. The more we scrub what is published on Github, the more I think we might need to consider a seperate solution to address @safeandfree's needs.

Mateo

safeandfree commented 8 years ago

Yes Mateo is right about what I really need from this.

Yes, I will be at Civic Hack tomorrow evening. Sorry to have been a bit out of pocket. SO much going on.

Kathy Mitchell Texas Criminal Justice Coalition | Grassroots Sentencing Campaign Coordinator 1714 Fortview Road, Suite 104 Austin, Texas 78704 Office: (512) 441-8123 Ext. 116 Fax: (512) 441-4884 www.TexasCJC.orghttp://www.texascjc.org/ | www.facebook.com/TexasCJCx-msg://4/www.facebook.com/TexasCJC | www.twitter.com/TexasCJCx-msg://4/www.twitter.com/TexasCJC

TCJC works with peers, policy-makers, practitioners, and community members to identify and promote smart justice policies that safely reduce Texas’ costly over-reliance on incarceration – creating stronger families, less taxpayer waste, and safer communities. DONATE TODAY!https://co.clickandpledge.com/sp/d1/default.aspx?wid=45713

From: Mateo Clarke [mailto:notifications@github.com] Sent: Monday, August 22, 2016 11:03 AM To: open-austin/project-ideas Cc: Kathy Mitchell; Mention Subject: Re: [open-austin/project-ideas] Turn daily arrest data posted by Harris County into a real database (#73)

Thinking back to the problem that @safeandfreehttps://github.com/safeandfree is trying to solve...

A primary purpose for me is to be able to surface patterns and ID people who are the victim of patterns of over policing. Yesterday, I took several days of the data, merged it into a single data file, and then ran cross tabs in order to ID people who were arrested on a single class C misdemeanor charge so I could write them a letter asking for more information about what happened and why they were actually jailed on an offense that does not have jail as an available punishment (Class C misdemeanors are "fine-only" offenses).

I think getting @safeandfreehttps://github.com/safeandfree set up with the full unabridged dataset to use either locally on her computer, or a privately accessible server (like the Azure SQL db) is what is needed.

@safeandfreehttps://github.com/safeandfree will you be at the Civic Hack Night tomorrow? Just want to make sure what @colbywhitehttps://github.com/colbywhite has been working on gets you a solution that works for what you are trying to accomplish. And thanks to @mscareyhttps://github.com/mscarey for speaking up about privacy concerns. The more we scrub what is published on Github, the more I think we might need to consider a seperate solution to address @safeandfreehttps://github.com/safeandfree's needs.

Mateo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-austin/project-ideas/issues/73#issuecomment-241461827, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARrnXxzOTjVhp-bjsB_Os_acR82O5xoBks5qichFgaJpZM4JPQ9o.

decibel commented 8 years ago

Useful background about PII: https://en.wikipedia.org/wiki/Personally_identifiable_information

I'm not an expert on PII (though I've worked on databases for 30 years), so I'm not sure if arrest records are exempted.

I will say that simply hiding the fact that the county is releasing all of this information is a big dis-service to the community. While this project doesn't necessarily have to expose the same data that the county does, I think it should let visitors know what information the county provides that is not being exposed. People have a right to know what information is being made public.

Please understand I don't have an axe to grind with Harris county. They should be following appropriate laws about release of information (which hopefully they are). In either case, part of open government should be making it clear to citizens what data is out there.

open-austin / project-ideas