livgust / covid-vaccine-scrapers

Open-source project using Nodejs and Puppeteer to scrape websites for COVID vaccine availability in Massachusetts. Can be modified to suit other areas and needs.
MIT License
66 stars 33 forks source link

Duplicate entries in saved JSON #197

Closed rjcohn closed 3 years ago

rjcohn commented 3 years ago

I've noticed a bunch of duplicates in the cached JSON, usually with not quite matching addresses. For example, in the JSON with this timestamp: 2021-03-30T141231Z

Gillette Stadium - EAST Clinic Gillette Stadium Gillette Stadium - East 1 Patriot Pl. Gillette Stadium - WEST Clinic Gillette Stadium Gillette Stadium - West 1 Patriot Pl. Hannaford (Middleborough) 8 Merchants Way Hannaford (Middleborough) 8 Merchants Way Marshfield Fairgrounds 61 South River St GATE E Marshfield Fairgrounds 61 South River Street GATE E Marshfield Fairgrounds 61 South River Street Marshfield Fairgrounds 140 Main Marshfield Fairgrounds 140 Main St Randolph InterGenerational Center (RICC) 128 Pleasant Street Randolph InterGenerational Center (RICC) 128 Pleasant St Reggie Lewis State Track Athletic Center 1350 Tremont St. Reggie Lewis State Track Athletic Ctr, Tremont Street, Boston, MA, USA 1350 Tremont Street Reggie Lewis State Track Athletic Ctr, Tremont Street, Boston, MA, USA 1350 Tremont St Saint Vincent Hospital Vaccine Collaborative @ Worcester State University - Wellness Center 525 Chandler Street Saint Vincent Hospital Vaccine Collaborative @ Worcester State University - Wellness Center 486 Chandler Street South Boston Community Health Center 409 W Broadway South Boston Community Health Center 409 W Broadway Walgreens (Mattapan) 90 River St Walgreens (Mattapan) 90 River St Walgreens (Pittsfield) 37 Cheshire Road Walgreens (Pittsfield) 37 Cheshire Rd

I notice it because I have some code that generates an email listing all the sites with availability, and for Marshfield in particular I see dups. Marshfield isn't listed on the website at all (even when showing sites without availability). Are you just filtering that site out?

harcod commented 3 years ago

First, not unexpected.

The front-end is filtering out everything that is older than 60 minutes. I would recommend that you do the same for now. You can see a developer's eye view of this at http://covid.harcod.com/stale

Since we don't have a database (yet), we are carrying along all historical sites in the JSON file for now. That will change soon and we can perform the "60 minute filter" on the back-end.

Many of the addresses that we get come directly from the source websites. Sometimes we have them hard-coded in a config file, but not many.

rjcohn commented 3 years ago

Thanks. I'll add that filter.