data_reader.py now only queries distinct patids from the private_address_history table

It's a nice clean diff but unfortunately I don't think it's going to be enough. First I'm a little surprised that when we do a distinct on patid, that we get other columns back. Normally in SQL "select distinct(x) from ..." means you get only the distinct values of x and not any other columns. But I have two other concerns:

If I'm reading this right we're using this function: https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.distinct Which says:

optional column expressions. When present, the PostgreSQL dialect will render a DISTINCT ON () construct. Deprecated since version 1.4: Using *expr in other dialects is deprecated and will raise CompileError in a future version. So it works in postgres but apparently no other DBs. At least one of our partners is using Oracle, and I think some others might be using MS SQL? But we can't assume everyone is using postgres.
If an individual DOES have multiple rows, I'm not sure which one this is going to select. There aren't super detailed database constraints on the table, meaning a really dedicated user can probably always do something that will cause us to pick the wrong one or have to pick randomly, but we should at least try to pick one intelligently if we can. At minimum we should filter by address_preferred = 'Y'. (And double check me on the schema there, I think it's 1 char Y or N but I may be misremembering). There's also a set of date range columns where we could select by whichever row is active.

mitre / data-owner-tools

data_reader.py now only queries distinct patids from the private_address_history table #42