Investigate whether address scheme too strict

ludstuen90 commented 6 years ago

Investigate whether we are enforcing too strict an address scheme by saying addresses must be stored with each item separate...

Street number
Street direction (E, NW, NNW, etc)
Street name
Street type (Court, Blvd, etc)

ludstuen90 commented 6 years ago

More I think about it... instead of parsing out at the database level the street number, direction, name, etc... what if we left this work to be done after we've stored info to the database?

That way we can scrape the data that exists no matter what.

Instead of storing street number, direction, name and type all as separate fields, we can store:

primary_address_line = models.CharField(max_length=72, blank=True)
secondary_address_line = models.CharField(max_length=72, blank=True, help_text="Apartment, Floor, Etc. ")

and then for Tax properties, store :

secondary_name = models.CharField(max_length=72, blank=True)

_Property and owner addresses won't seem to have an extra name field... but we might discover later on that they do. In this case, we can promote secondaryname to be a part of all classes

@walinchus How does this sound? I'm trying to think about what value we'd get from parcing it out at the database level... and it seems like it might be best to do the most "destructive" actions (ie: most manipulative to the data that's originally in the county DB) at the analytic level, and not at DB level.

ludstuen90 commented 6 years ago

Essentially I'm just looking for a clear 👍 or 👎 re: "I would get a lot of value from transforming addresses like "19 w 25th st" into separate database fields at the database level, such that it would be: Street number = 19 Street direction (E, NW, NNW, etc) = w Street name 25th Street type (Court, Blvd, etc) street

My sense is the answer to this question is "no, I would not get a lot of value from this."

But let me know if I'm assuming incorrectly! :) @walinchus

walinchus commented 6 years ago

Sorry for the late response on this, as I wasn't sure. My understanding is that because it doesn't matter the order to Google, it shouldn't matter for mapping purposes, correct?

My hunch is that separating them out would be better, as most applications need them separately though in R you don't need to so.

ludstuen90 commented 6 years ago

All good, no worries! I think it depends on how we want to use it. Like you mentioned, Google is fine for the most part with unparsed addresses.

If we're designing for R and Google, then I think we're fine not to separate at DB level.

Once I have Warren county data done, we could do a test run if you like!

ludstuen90 commented 6 years ago

More I think about it, separating them out might be the way to go, despite its complexity.

I'm thinking more about how we can identify duplicate addresses, and it will be hard if we don't parse. (One extra space could mean we make a duplicate record).

I had initially wanted to avoid this, since I think it will take some time, but at the same time I'm starting to see how it would provide a lot of value. (If only Google geocaching API were free for unlimited searches!)

ludstuen90 commented 6 years ago

@walinchus Hmm. Do you think you could find out how much the non-profit credit is for the maps API?

I think there's a fork in the road here... either roll our own address parser, or query the maps API for lat/long... potentially throttling to make sure our expenses stay in check.

walinchus commented 5 years ago

It wasn't terribly apparent from their documentation, so I will contact them today.

Lucia Walinchus Executive Director

Phone: 646-397-7761 Email: Lucia@EyeOnOhio.com Twitter: @SoSaysLucia Want to get investigative news first? Sign up for our free newsletter http://EyeOnOhio.com! (Homepage bottom left)

On Mon, Nov 19, 2018 at 5:29 PM Lukas Udstuen notifications@github.com wrote:

@walinchus https://github.com/walinchus Hmm. Do you think you could find out how much the non-profit credit is for the maps API?

I think there's a fork in the road here... either roll our own address parser, or query the maps API for lat/long... potentially throttling to make sure our expenses stay in check.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ludstuen90/ohio/issues/33#issuecomment-440065081, or mute the thread https://github.com/notifications/unsubscribe-auth/AYwbZcIUvTRVuSOA6OEyMdOOquShN_Geks5uwzCxgaJpZM4YcPo- .

walinchus commented 5 years ago

Okay our credit is $250 dollars a month.

walinchus commented 5 years ago

This is the pricing: https://cloud.google.com/maps-platform/pricing/sheet/?__utma=102347093.542729329.1541101261.1542740024.1542740024.1&__utmb=102347093.0.10.1542740024&__utmc=102347093&__utmx=-&__utmz=102347093.1542740024.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)&__utmv=-&__utmk=158917178&_ga=2.126767885.1676679464.1542739957-542729329.1541101261

I'm not sure which service R uses, But it looks like we have at least 100,000 requests a day.

ludstuen90 commented 5 years ago

Awesome, thanks for sending that over! This helps a lot.

It looks like the particular type of call we'd be doing is 'geocoding' https://developers.google.com/maps/documentation/geocoding/intro#GeocodingResponses

Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway,
 Mountain View, CA") into geographic coordinates (like latitude 37.423021 and 
longitude -122.083739), which you can use to place markers on a map, or position the map.

It looks like with the credit, we could get about 50.000 calls per month. (From what I can tell it looks like the number of calls are per month, not per day)

https://cloud.google.com/maps-platform/pricing/sheet/?__utma=102347093.542729329.1541101261.1542740024.1542740024.1&__utmb=102347093.0.10.1542740024&__utmc=102347093&__utmx=-&__utmz=102347093.1542740024.1.1.utmcsr=google%7Cutmccn=(organic)%7Cutmcmd=organic%7Cutmctr=(not%20provided)&__utmv=-&__utmk=158917178&_ga=2.126767885.1676679464.1542739957-542729329.1541101261

I see a little more than 100.000 records in Warren County... so if we went this route, we could get all of Warren County stored in 3 months (and still have some room left over).

If we stored them in the DB, too, then we wouldn't need to make additional calls for reporting.

How does this land with you?

walinchus commented 5 years ago

Hmm. You're right Google says "per month" whereas the R documentation says per day. I'm wondering if this is an R thing or it's just outdated. I will look at this and find out.

ludstuen90 commented 5 years ago

Right on, sounds good - thanks for doing that!

walinchus commented 5 years ago

G Suite Support said this question is "outside the scope of their support offering" and bucked me to the Google Maps & Earth help forum which unfortunately was no help at all.

I asked the NICAR listerv though and they said they thought the ggmap documentation is outdated, as it was released before Google updated their pricing.

It looks like, based on thIs: https://developers.google.com/maps/billing/understanding-cost-of-use?hl=en_US#geocoding that the first 100,000 queries are .005 USD each and the next 400,000 queries (up to 500,000) are .004 USD each.

So I think we will limit geocoding API queries for visualization purposes. As that is the biggest bang for our buck in a journalistic sense. For example, mapping where LLC sales are, or where land bank properties are, etc.

ludstuen90 commented 5 years ago

OK awesome, sounds good! Thanks so much for doing the legwork on this! My read on this then is just that we'll want addresses stored in a way such that we can easily search with them.

IE: Something like "owner name" / "19 w 25th St" all in one line would be fine, rather than needing to separate to the level of Street number = 19 Street direction (E, NW, NNW, etc) = w Street name = 25th Street type (Court, Blvd, etc) street

If this sounds good to you, would you mind closing this issue? @walinchus

walinchus commented 5 years ago

Oh sorry. Yes sure.

ludstuen90 / property_project_ohio

Investigate whether address scheme too strict #33