ncopenpass / CampaignFinanceDataPipeline

Data Pipeline for NC Campaign Finance Dashboard
Apache License 2.0
2 stars 4 forks source link

Can we geocode transactions #8

Closed ChrisTheDBA closed 3 years ago

ChrisTheDBA commented 3 years ago

We have millions of transactions, this won't be cheap, fast, and easy.

davidpeckham commented 3 years ago

What if we used the paid geocode service from OpenCage to get through the backlog of historical transactions, and then switched to their free service (up to 2,200 hits a day, 75K hits a month) or MapBox (up to 100K hits/month) to keep up with quarterly transaction updates? We'd need a scheduled service on our end to batch process transactions daily.

How many transactions do we already have? How many new transactions per quarter? I haven't found the dataset yet, so when I find that I can answer these myself.

davidpeckham commented 3 years ago

I see that Code for America Labs is organized as a 501(c3), so maybe the national organization could help us get an Esri non-profit license:. The non-profit license includes service credits that we could spend on their Python-friendly online bulk geocoder.

The Esri Nonprofit Organization Program (NPOP) is designed to provide qualified nonprofit organizations with an affordable way to acquire Esri solutions. All applicants undergo a review process upon submission of the NPOP application that considers formal tax-exempt status and the mission of the organization. In the United States, this is signified by your 501(c3) designation and National Taxonomy of Exempt Entities (NTEE) category. For international applicants, your organization’s eligibility and tax-exempt verification will be determined by your local Esri distributor as part of the application process.

NPOs generally considered for NPOP membership include noncommercial entities with humanitarian, conservation, and community services missions. Commercial, governmental, or organizations with a primary focus on economic development, as well as primary, secondary, or higher education institutions are not eligible for the NPOP. While educational service organizations with a specific conservation, humanitarian or community service mission may qualify, we strongly encourage universities and other higher education organizations to explore Esri’s GIS for Education programs.

davidpeckham commented 3 years ago

Here is a roundup of geocoding services. It's circa 2017, but still a good starting point.

I need to know more about the addresses we need to geocode to understand which of these services would work best (cost, capabilities):

  1. How many addresses do we already have?
  2. How many do we add each quarter?
  3. Are these residential addresses, commercial, or both?
  4. Do we need precise latitude and longitude of the address?
  5. Do we need to tie these addresses to legislative districts?

I'll try to answer (1) to (3) myself as soon as I get the dataset.

ChrisTheDBA commented 3 years ago

There are roughly 1M unique addresses for the accounts in transactions part of the data. Bring in the voter data and we are talking 13M+ for the 2020 snapshot.

Services are thus prohibitively expensive or frustratingly slow because free source limit to 1000-5000 records per batch.

We need to think of a local geocoder that can be run as part of the ingestion process like Degauss or Pelias.

other answers. We are not under CFA's 501c3 umbrella, besides using commercial sources like ESRI(particularly ESRI and MS) are discouraged in the larger community.

  1. There is an import date that can be used as an indicator of new records. The dedupe/linkage process requires a complete purge and rebuild with the current methods used. But geocoding could be limited to an appropriate subset.
  2. Both residental and commercial 4 & 5. We don't have specific requirements for geocoding, but deisgning with the most flexibility for possible uses is best. So lat/long/census divisions/legaslative districts(local, federal)
davidpeckham commented 3 years ago

Where is local for us? Do we have cloud infrastructure for apps like Degauss or Pelias?

We need to think of a local geocoder that can be run as part of the ingestion process like Degauss or Pelias.

ChrisTheDBA commented 3 years ago

I guess it was fast easy and cheap/ Downloaded the degauss package and process about 1M account addresses with lat, long, and census block. Tweaking the parameters to get most matches.

ChrisTheDBA commented 3 years ago

The degauss docker container process 850K record in 30-40 minutes with a match rate of +95%.

Opening a new ticket to add a code block to the data import to process new records.