Revisit data model/DB structure

Mr0grog commented 3 years ago

There are a variety of things in how we’ve modeled the data that I’m not sure I’m happy with and would appreciate some extra thought or at least gut-checking from other people on:

Locations have an id that is externally determined (with the intent that it ought to be meaningful if possible). I’m not sure this was a good idea. It gave me a short-term way to cheat and skip #48, but it’s one reason #39 involves a whole lot more code than it probably should (I think I can improve it later today when I have time, but that doesn’t change this as a more fundamental issue).
The availability table is a little complex: it’s unique by (location_id, source), so the idea is basically that we store the latest of each source for each location. The goal here was twofold:
1. Allow for a source that’s in production for a given location to operate alongside a source that’s being tested/staged for a given location (e.g. “CVS Scraper” and “CVS API” — in NJ we did this for a little while to compare their output and make sure things were working as expected with real-world data).
2. If multiple sources are producing data for the same location, leave open the possibility for consumers to choose which they want to surface if they like one but not another.
But I think (2) is more complex than is really warranted or useful for most people (and I wound up making the API just return the latest for each, so the multiple sources are kind of invisible). This also leads to queries that feel unnecessarily complex to me: https://github.com/usdigitalresponse/appointment-availability-infra/blob/c0f5c99b9dec78b6e409d43e963c08e9a6044b31/server/src/db.ts#L148-L156

CovidWA handled this by having a normal table and a “beta” table, where the beta one was for things being tested/staged that shouldn’t yet be public. Would that be a better approach here?
External IDs as a JSONB object (e.g. { "vtrcks": "CV1001240", "cvs": 1034 }). I think this still makes sense (although it needs an index for #48). Other approaches that are possible here:
- A string array (e.g. { "vtrcks:CV1001240", "cvs:1034" }) — this is probably slightly simpler/speedier in Postgres, but not sure.
- A way more complex structure with URIs. SMART Scheduling Links uses:
```
[
  {
    "system": "https://cdc.gov/vaccines/programs/vtrcks",
    "value": "CV1654321"
  }
]
```
  And Vaccinate the States has this for “links,” which is a similar-but-not-same use case:
```
[
  {
    authority: str as authority enum e.g. rite_aid or google_places,
    id: str as id used by authority to reference this clinic e.g. 4096
    uri: str as uri used by authority to reference this clinic
  },
  // ...
]
```
Address components are meant to be an object in the output (they aren’t and that’s a bug), but they are independent columns in the locations table. Should they be a JSONB column?
is_public is a boolean on multiple tables that defaults to true. Wondering if this should be the opposite (is_private DEFAULT false)

Mr0grog commented 3 years ago

Updates:

We’ll probably change id to an int to make it more clearly meaningless. This bumps up the priority on #48.
External IDs should probably become a separate table so we can store more complex data than just “system + ID”
Leave address storage as-is for now; it’s not a big deal

Other minor change: for clarity, rename the timestamps on availability.

updated_at → valid_at
checked_at → requested_at
Also considering adding changed_at for the last time the data in the row was different.

Mr0grog commented 3 years ago

High priority for the availability schema changes (useful for #52), everything else less so.

astonm commented 3 years ago

@Mr0grog which of these ideas, if any, still remain to be implemented?

Mr0grog commented 3 years ago

Just the issues around is_public and around address fields as an object, but I’m not sure either of those are really worth messing with at this point.

There are also two changes from this old doc: https://docs.google.com/document/d/1r9OHWal5oAUIsWDHiQAMQkOcGZUFEZCsTo7gxH34HlU/edit

Drop provider_locations.eligibility
Add provider_locations.timezone and ideally fill it in if not otherwise set from zip code (I've got a data file that maps these that should probably land in Postgres, but could also be in the source code).

…but I think it’s better to make separate issues to track these; this issue is a confusing mess.

Mr0grog commented 3 years ago

(See #154 & #155 for those last ones.)

usdigitalresponse / univaf

Revisit data model/DB structure #53