synthetichealth / synthea

Synthetic Patient Population Simulator
https://synthetichealth.github.io/synthea
Apache License 2.0
2.13k stars 641 forks source link

Typo in zipcodes.csv #875

Open drwallace opened 3 years ago

drwallace commented 3 years ago

,California,CA,San Francisco,,37.727239,-123.032229

Double commas after "San Francisco".

ramdesh commented 3 years ago

Looks like there isn't a zipcode associated with that point. @dehall @jawalonoski what's the standard fix for this? I can see if I can find a zipcode.

ramdesh commented 3 years ago

Looks like this lat/long points at Farallon Islands: https://en.wikipedia.org/wiki/Farallon_Islands Seems to me that there is no mappable zip code for that location, in which case we should remove this line from zipcodes.csv. WDYT?

dehall commented 3 years ago

@ramdesh Thanks for doing a little digging. The short answer is yes we should remove this line from the file.

The longer answer: The zipcodes.csv and demographics.csv files are a mishmash of various files from a number of sources, which are then joined together in ways that try to map a lot of different concepts from different jurisdictions into a simple schema. It doesn't always work perfectly. But given the size of the data it's impossible to manually curate every line. We can fix them as we discover issues, and hopefully it helps us find out a new class of fixes to make more broadly. (For example, see #768) In general we want at least one line in zipcodes.csv for every line in demographics.csv, and each line in zipcodes should have a zipcode and lat/lon. It's ok to have one line with no zip code if we can't find a good source for what to put there.

In this case, since we do have real (20+) zip codes for San Francisco, we can safely delete this line without a zip code. It might be worth a deeper scan to see if there are other instances of lines without a zip code, where we have other lines for the same place that do have a zip code.

ramdesh commented 3 years ago

From what I can see there are around 25,285 lines in zipcodes.csv that don't have zip codes attached. This will probably need a bigger fix than deleting lines.

eedrummer commented 3 years ago

@ramdesh Maybe. You can see the previous cleaning of the zipcodes.csv in #593. This has a reference to the Jupyter Notebook we used to perform an analysis on what lines in the file to keep or toss.

While these lines do not have a zip code, they should have a different lat/lon than other lines in the file for the same city. When making changes to the file I decided to keep them as it seemed like it would provide a better spread of patients around a particular city.

When looking through cities with multiple lines, I tried to filter out very obvious outliers. In places like Alaska, we previously had lines for the same city that were hundreds of miles apart. I set a simple threshold to try to get rid of the most egregious cases, but it clearly missed the instance you found for San Francisco. 30 miles off the coast may seem like a pretty big outlier, but some cities are genuinely big. Jacksonville, FL is ~36 miles across, so hard and fast rules to apply to the data set that would catch something like this is hard. Although I may be thinking about this incorrectly and am open to suggestions.

Are you aware of other lines without zip codes that are clearly outside the bounds of a city?

ramdesh commented 3 years ago

Looks like the methods you employed were pretty reasonable. I can't think of a better way to check for remote locations like Farallon Islands right now, but I'll see if there's something we can do.