hackforla / ballotnav

A repository for HackforLA's BallotNav project
https://BallotNav.org
GNU General Public License v2.0
24 stars 15 forks source link

Scrape NJ data for the upcoming June Elections #382

Closed allysmatrix closed 3 years ago

allysmatrix commented 3 years ago

Overview

In an effort to gather useful data before the June election. BallotNav wants to scrape NJ data ahead of time.

Action Items

Resources/Instructions

Finalized features for BallotNav 2021

ExperimentsInHonesty commented 3 years ago

Please provide update

  1. Progress
  2. Blockers
  3. Availability
  4. ETA
giosce commented 3 years ago

I created a python script to scrap NJ State website https://nj.gov/state/elections/vote-secure-drop-boxes.shtml With the script I created this csv file https://docs.google.com/spreadsheets/d/16oKiMoLT1B3kfSd-Uo5LozL23HpLLnQHYEp3a6ksq6Q/edit?usp=sharing @aNullValue can you please take a look and let me know if format is acceptable? There are a handful of unclean entries but they can be fixed. @aNullValue can you also let me know here where is the export of the current NJ data?

aNullValue commented 3 years ago

AFAIK, the most recent data that we have from all states other than Georgia is available at https://github.com/hackforla/ballotnav/tree/master/backend/db/states

aNullValue commented 3 years ago

Regarding your example spreadsheet:

For that info, need to have the following columns at a minimum:

The goal is to split everything out to the maximum extent possible. The address components, in particular, must be divided into columns. Do note that if something says "Room", "Apartment", "Suite", etc., that information should be moved into an "Address (2)" or similar column -- not in "Address (1)".

Some states/jurisdictions provide municipality information. For the majority of states and in the majority of cases, the municipality information should be discarded, because it has no bearing on the election itself -- it's just there as a convenience for users of their site, and doesn't really have a logical place in our design. This includes -- to the best of my knowledge -- NJ. In a few states, municipality is more important than county; we instead discard county information and use only municipality information. That applies mostly to the New England area states, plus Michigan and Wisconsin.

Note also that the above list of columns is not comprehensive for the data that we ultimately want to collect, but it's what is relevant for the data readily available from NJ.

kcoronel commented 3 years ago

@giosce please read the above message from Drew who explains what I was trying to say on slack much better.

Arjayellis commented 3 years ago

@giosce Please provide an update Progress Blockers Availability ETA of completion

giosce commented 3 years ago

The spreadsheet provided by Karen https://docs.google.com/spreadsheets/d/1UzYSmz6OQ8O2PnjCrER5FpAC1C1xwdFYJQhS3XXo_nc/edit?usp=sharing corresponds to this website https://www.state.nj.us/state/elections/vote-county-election-officials.shtml The one I scraped is https://nj.gov/state/elections/vote-secure-drop-boxes.shtml (actual boxes in the streets without phone or person to reach). I have uploaded the csv in google drive. I should be able to split the address as @aNullValue has posted above. I'll build a scraper for the election-officials and I'll upload the csv in google drive. The spreadsheet that Karen shared has much more info like phone & fax number. So I think the 2 scrapers will create 2 csv files (that will also be used for continuous comparison) and we'll merge them for DB uploading (or have 2 imports).

giosce commented 3 years ago

Still working on scrapers for elections-officials and dropoff boxes. I hope to have both csv samples in a week or so.

giosce commented 3 years ago

I'm at good point scraping the "elections officials" website, this is the latest draft https://drive.google.com/file/d/1uRu-eIWaZGTXvNXVS-qKuBsD7EIBfwDE/view?usp=sharing

Feel free to provide feedback, I know there are a handful of entries with problems, we can discuss the strategy to decide whether it makes sense to fix them in the scraper or manually.

I started using a python addresses parsing library with which I'm now scraping the "dropoff boxes" website. Hopefully I'll have a draft of this by the call next week.

The strategy I suggest is:

Let me know.

Arjayellis commented 3 years ago

Thanks, Gio. This looks good. One thing that will make this much easier is having the data consistent and all columns equivalent to the schema we currently have in our DB on the front end to minimize adjustments later on. I'll have @aNullValue weigh in before we move forward, but please review the below resources to get an idea of what I mean. We can discuss further tonight.

https://docs.google.com/spreadsheets/d/1LXkjKz7eWdh71NDrq1lYnVN4DKrZ4UfMPWFkH7h16Nk/edit#gid=1304858482 https://github.com/hackforla/ballotnav-states/issues/40

kcoronel commented 3 years ago

@jmensch1 @aNullValue can you take a look at Gio's data, thanks (as a reference to our scraped data @alligatormonday)

kcoronel commented 3 years ago

@kcoronel add screen shares of Jake and Drew's convo on slack re: NJ data