opensafely / emis-qa

MIT License
0 stars 0 forks source link

Investigate and describe total patient counts #1

Open sebbacon opened 3 years ago

sebbacon commented 3 years ago

The studied population contains about 1,493,041 fewer patients than EMIS reported to NHSD in January 2021, per the following:

  1. We think EMIS + TPP are 97% of the population, but that is necessarily an estimate
  2. NHSD say that in January 2021 there were 60,6580,000 patients registered at a GP practice
    1. So we'd expect our total to be 58,800,000 (~97% of NHSD total)
  3. EMIS have told us they reported 35,235,000 to NHSD in January 2021
  4. We counted 57.8 million as-of the date of the last run
    1. Of these we counted 33,700,000 in EMIS
  5. Therefore our total England counts are ~1m patients short of what is expected, and the EMIS count is 1.5m patients short of what's expected

(precise numbers here)

Per #2, I've been told that more than 1m patients are expected to be added in the latest build, so this issue is probably going to be resolved.

However, we should do the following:

Potentially do all of these as time-series plots.

We will need to keep historic data from the previous run to do the delta; we could perhaps handle this by just writing to a (series of?) data files that only exist on the server.

sebbacon commented 3 years ago

Notes from call. The opt-out is 100s of thousand (e.g. 900k?). Should get the data approximately tomorrow

sebbacon commented 3 years ago

The list of organisations being used is a static list. They are updating to be from all NHSE open, and closed GP practices, with a future date.

Also note that there are various business rules which exclude some kinds of patients e.g. temporary patients. We're getting a list of business rules by tomorrow from Nas, for use in the paper.

sebbacon commented 3 years ago

On the patient counts they've been investigating.

There are four reasons for differences:

  1. Type 1 opt-outs is ~900k
  2. Organisations lists. The list of practices is cautious. It's 3920 practices at the moment. There's about 40-50 we might get. All open, plus those closed with a future date.
  3. Deduplication on patients being active in two practices: will always give us the latest registration where they are in two practices.
  4. NHS number validation, e.g. they exclude test numbers
  5. Late arrival of data in EMISX

What they call "rebulking the data" - 20 billion obs - very big job.

They hvae a job in airflow which they trigger which takes a day or so.

The jobs are monitored by ops people (e.g. noticing resources running out or presto having had a restart). On that day, presto restarted, and a job was restarted but not gracefully.

They have a theory about the 200k patients:

This is what they do right now

image

If a practice is marked as closed some time between two time points, it will be automatically excluded (in the highlight).

So we can verify this by counting the practices.

Lots are marked for closed with a future date (i.e. merging).

Dima has 4 screens

Here is the bit with the static organisations: image

Confused by the patient thing. They have a list of fixed ou