owid / covid-19-data

Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data
https://ourworldindata.org/coronavirus
5.66k stars 3.64k forks source link

Testing data for South Korea #2219

Closed WWolf closed 2 years ago

WWolf commented 2 years ago

JFYI:

Due to the webpage update, ROK publishes now the daily testing data instead of the cumulative one, and as OWID is probably only crawling this metric weekly, all the positivity rate and test date in OWID is not updated.

Official source: http://ncov.mohw.go.kr/en/bdBoardList.do?brdId=16&brdGubun=161&dataGubun=&ncvContSeq=&contSeq=&board_id=

Official test positivity metric is calculated as (covid case of the particular day) / (test numbers of the preceding day).

I do not know whether this goes against the principles of OWID, but these daily cumulative numbers are still recorded at Asia Regional Information Center at Seoul National University (TOTAL_TEST) on the first google sheet:

https://sites.google.com/view/snuaric/data-service/covid-19/covid-19-data

lucasrodes commented 2 years ago

@camappel Did #2217 fix this issue?

WWolf commented 2 years ago

Although I do not see the update yet, does this https://github.com/owid/covid-19-data/pull/2217 effectively crawl daily data (also considering time zone differences)?

Also it seems it does not backfill the missing data points and I wonder whether manual pull request to backfill the missing data would be allowed (I can contribute to that).

P.S. I also wonder how the update would affect the test positivity metric as the official stats has a day difference (covid case reported from the day divided by the test numbers of the preceding day)

lucasrodes commented 2 years ago

Although I do not see the update yet, does this #2217 effectively crawl daily data (also considering time zone differences)?

The update of the global files is scheduled to run in the next hours, tops 24 hours. For now, I have updated the individual file for South Korea with the last available data point from 2022-01-05.

This part of the script parses the reported date from the source (http://ncov.mohw.go.kr/en/bdBoardList.do?brdId=16&brdGubun=161&dataGubun=&ncvContSeq=&contSeq=&board_id=)

https://github.com/owid/covid-19-data/blob/0f7573f52568f6023f8da7df6df2da456ef308bc/scripts/scripts/testing/automations/incremental/south_korea.py#L30-L34

Also it seems it does not backfill the missing data points and I wonder whether manual pull request to backfill the missing data would be allowed (I can contribute to that).

The update is incremental, as the source only reports the daily figure (not historical data). A Pull request to backfill the missing data would be very much appreciated, should be editing this file

P.S. I also wonder how the update would affect the test positivity metric as the official stats has a day difference (covid case reported from the day divided by the test numbers of the preceding day)

The positive rate is computed later in the process by dividing the daily new cases by the daily number of tests, all smoothed over a window of 7 days:

https://github.com/owid/covid-19-data/blob/23dcaaad6c63586565e531e15a25f7ed34e528e2/scripts/scripts/testing/generate_dataset.R#L174-L179


Hope this helps

WWolf commented 2 years ago

Thank you very much for the detailed response @lucasrodes !

As for the backfill, I will take some time to submit a pull request this weekend (also related: https://github.com/owid/covid-19-data/issues/2225 new hospital admissions).

As for the positivity rate, South Korea official way of calculation is again one day shifted between nominator and denominators, although 7 day smoothed average may not introduce too much variation.

P.S. BTW, I found that the official stats are labeled "tests performed" in contrast to "people tested". There are two separate stats from Korean government, which both have "tests performed" and "people tested". As for the backfill would it be better to choose one from the other or even record both? I also think that the value that is crawled is "people tested" not "tests performed". I can double check this later and put to the pull request as well.

WWolf commented 2 years ago

I was working on the backfill, and found out that the testing stats that's in OWID is actually only one type of testing stat. KDCA publishes two types of testing stats, (a) "Number of suspicious report testing" and (b) "Number of testing at temporary screening stations".

The old OWID testing counts essentially is the cumulative count for the first one and excluded the second one. The number of testing were comparable during the last winter (2020-2021), but for this winter, (b) is about 2-3 times more than (a). KDCA also publishes separate stats of positivity rate for (a) and (b), and the positivity rate is smaller for (b).

At the moment, what is being crawled is (a)+(b). How should I backfill the data? The earliest data that I can access (Korean) for (b) is from 2020-12-18, although the (a) data dates back to 2020-01-20.

camappel commented 2 years ago

Hi @WWolf ,

Thanks a lot for looking into this.

The new and backfilled testing figures should include (a) + (b). Data prior to 2020-12-18 can include only (a); we can remove positive rate estimates for the week following that date to account for a spike in the figures.

As for the unit labels, the backfilled figures should be in the same unit as the value that is being crawled. If the the value that's being crawled is indeed people tested - not tests performed - we will change the label, and the backfilled figures should refer to people tested. On the the English version of the dashboard, testing figures refer to 'Number of testing', which is unclear; perhaps the @Korean version is more clear?

Thank you pointing out the lag in positive rate calculation; we can solve that issue quickly after.

Let me know if you have any more questions!

WWolf commented 2 years ago

I have checked this, (a) is indeed people tested, but (b) is in Korean language more like tests performed.

But given that:

So I believe (a)+(b) could be assigned as people tested and once confirmed will go ahead for a pull request!

WWolf commented 2 years ago

I found Korean publication record that (b) is also people tested. https://www.kdca.go.kr/board/board.es?mid=a20501010000&bid=0015&list_no=717855&cg_code=&act=view&nPage=12

image

So I will go ahead of doing the pull requests (found out small glitches in the series, will fix and request soon).

WWolf commented 2 years ago

I have put the pull request https://github.com/owid/covid-19-data/pull/2245 for the backfill.

The following graph compares between the original testing stats in OWID (black) and the testing stats sourced from SNU ARIC (only (a)): image

The following is (a)+(b) (after 2020-12-18): image

There are small discrepancies in numbers and I double checked few official individual reports that indeed official sources sometimes retrospectively change the numbers by a few (less than 10, usually 1-2).

camappel commented 2 years ago

Hi @WWolf ,

I just merged your pull request.

Thank you very much for your diligent work. I agree, it looks like 'people tested' is the appropriate label, as we prefer to align our data with the official figures. Great job identifying that the old figures did not include (b) also.

I will work on implementing the one day lag to match the official figures soon.

All the best!

lucasrodes commented 2 years ago

Hi @WWolf, I was checking South Korean data and realized that the data pushed in https://github.com/owid/covid-19-data/pull/2245 is quite different compared to the data reported by Asia Regional Information Center at Seoul National University (TOTAL_TEST).

My understanding is that this is due to what you mentioned

At the moment, what is being crawled is (a)+(b). How should I backfill the data? The earliest data that I can access (Korean) for (b) is from 2020-12-18, although the (a) data dates back to 2020-01-20.

My understanding is that you have obtained the values for (b) manually, from daily updates, computed the cumulative sum and added to the other tests registered in SNU ARIC's spreadsheet. Is that right?

Thanks

camappel commented 2 years ago

From what I understand:

  1. The figures in Asia Regional Information Center at Seoul National University (TOTAL_TEST) only include 'Number of suspicious report testing', or (a)
  2. The figures in Asia Regional Information Center at Seoul National University (Testing) have both 'Number of suspicious report testing' and 'Number of testing at temporary screening stations', or (a)+(b) since 2020-12-18
  3. We currently scrape the total of 'Number of suspicious report testing' and 'Number of testing at temporary screening stations', or (a)+(b), from KCDC's dashboard

2245 by @WWolf used (2.) to get (a)+(b) back to 2020-12-18, then (1.) to get just (a) back to 2020-01-20

WWolf commented 2 years ago

Hi @lucasrodes ,

@camappel is correct.

SNU ARIC collects both (a) on the main "Cases in Korea_Original" sheet and "Testing" sheet as "의심신고 검사자 수" (Note that the "Testing" sheet includes "Total" which is tests performed metric, not people tested metric). Typically you can find it from Korea DCA daily reports (Sample, you have to download PDF file and see page 5, screenshot below):

image

I have checked SNU ARIC's numbers and aside from a few glitches (there is a big one on 2022-01-07) that I manually checked the daily reporting (see above sample) and fixed it, the two numbers from the two sheets are identical.

The "Testing" sheet contains also "임시선별검사소 검사건수" which corresponds to "Number of testing at temporary screening stations", the (b) that I mentioned. The KDCA (former KCDC (now renamed to Korea Disease Control and prevention Agency)) started to publish this separate metric as the numbers became somewhat dominant starting from the last (2020-2021) winter wave in daily reports, and they also started to publish a separate metric of positive cases "임시선별검사소 확진자 수" (this is a subset of the reported total positive cases) from these temporary screening stations, so one can even derive positivity from these entities (which tends to be lower, as these stations essentially anyone can get tested walk-in, rather than for (a), which are tests performed generally from medical institution of suspected cases or due to tracing efforts). In any case, I think as a matter of consistency, SNU ARIC left the column name "TOTAL_TEST", although clearly that is only for the (a) type of testing and do not represent the bulk of testing conducted in Korea (see below for the graph).

Note also that the SNU ARIC "Testing" sheet contains two more columns "수도권 임시선별검사소 검사건수" (Number of testing at temporary screening stations, from the Seoul metropolitan area) and "비수도권 임시선별검사소" (Number of testing at temporary screening stations, outside the Seoul metropolitan area). Essentially this was a further breakdown of the (b) with corresponding case counts from these stations. Since November, that was summed back to (b).

Just for your information, this is the crude test positivity of (a)+(b) (blue), (a) (red), and (b) (green) that I made while preparing for the backfill (7-day rolling mean): image

Number of tests (7-day rolling mean): image

Hope this helps for the clarification!

camappel commented 2 years ago

감사합니다!

lucasrodes commented 2 years ago

Hi @WWolf, @camappel Thank you so much for the explanation. This helps me a lot.

We are now thinking if we could potentially automate this import, including historical data, which would ensure this is easier to maintain in the future.

We currently have two metrics:

  1. Cumulative total: Total number of people tested so far.
  2. Daily change in cumulative total: Daily number of people tested.

From now on we can proceed as follows to estimate both metrics:

Metric Cumulative total

Compute people tested = TOTAL_TEST + cumulative(임시선별검사소 검사건수)

Metric Daily change in cumulative total

Next, for metric 2 we can simply estimate it by taking the daily difference from metric 1 (the pipeline already does this).


Finally, you mention that you have added some manual fixes. Ideally, we should encode these in our script. We can see values between 2022-01-02 and 2022-01-07 might have been manually added.

WWolf commented 2 years ago

Hi @lucasrodes,

As for the manual fixes, in my pull request https://github.com/owid/covid-19-data/pull/2245/commits/3d630821be1bed6759f7a5765ccdaeec25008dd4 the Source URLs present the actual daily reports (the attached PDFs) that I parsed to fix it. 2021-12-07 is also fixed (this appears to be a human error from the SNU ARIC). I mainly checked this by myself because the main sheet column value difference from (TOTAL_TEST) and the one in testing sheet value (의심신고 검사자 수) did not match.

Just my two cents on the sourcing - I am a bit hesitant of the sourcing cumulative numbers, in particular the TOTAL_TEST as the representative metric for (a) from SNU. Reason of my hesitance described below(not objection, just for your further information):

(1) I see evidences that the daily values gets revised, typically in a tiny amount but there are glaring revisions, such as 2022-01-02 to 2022-01-07. This happened because of the year change (2021 -> 2022), KDCA misidentified repeated testing that spanned the year 2021 to 2022 as tests on different individuals, although it was repeated testing on same individuals. They put this correction as a footnote in 2021-01-08 daily report, c.f. PDF page 5.

(2) Accordingly, they revised the weekly cumulative numbers but I strongly suspect that they did not revise the total cumulative numbers (that TOTAL_TEST tracks).

(3) As a result, for example, 2022-01-07 difference is not fixed in TOTAL_TEST but is somewhat fixed in Testing page of SNU ARIC that track KDCA data. As evidence, as of today, KDCA daily report (containing the most up-to-date 2021-01-07 test metric) does have this test value on 2022-01-07 as 72,725 not 96,665.

(4) The SNU ARIC has it 72,728, 3 test difference which again is a revised value from 2021-01-08 report. So the daily numbers gets corrected over the course of a week in official sources.

(5) I think SNU ARIC caught this revision on 2021-01-08 because the Testing sheet update lags typically one day - the tests performed metric has an additional one day delay being reported, and the report contains previous 7 days of testing / case metrics that are revised.

(6) In sum, the TOTAL_TEST cumulative value has some errors in my opinion that gets aggregated over long period of time and may not be a good way to represent daily testing volume by getting the difference; SNU ARIC also only partially make retrospective revisions (typically one day).

(7) My suggestion would be, if you choose to track this site, it would be (partially) more stable to source both (a) and (b) metrics from the same sheet "Testing" - the downside would be that it lags one day of update.

Hope this helps!

lucasrodes commented 2 years ago

Thanks, @WWolf, I understand! Thanks for the response.

If we want cumulative numbers, we still may need the values from TOTAL_TEST before "2020-12-18", and then add values from "Testing" cumulatively after the aforementioned date.

cc. @camappel

WWolf commented 2 years ago

P.S. @lucasrodes and @camappel , note that last year, for awhile, 임시선별검사소 검사건수 was split into 수도권 임시선별검사소 검사건수 and 비수도권 임시선별검사소. A code snippet that might be helpful (if you are going to use R) below to avoid reinventing the wheel and translating smoothly the Korean characters to English. It is a slight modification of the export backfill code I made (with translations). It deals with the split of the temporary station data last year, and removes the most recent point (because it is incomplete):

library(tidyverse)

readxl::read_excel(
  "corona/SNU ARIC Dataset for Korea.xlsx",
  sheet = "Cases in Korea_Original"
) %>%
  mutate(
    Date = as.Date(DATE),
    TOTAL_TEST = as.integer(TOTAL_TEST),
    CONFIRM = as.integer(CONFIRM)
  ) %>%
  arrange( Date ) %>%
  mutate(
    test.suspicious2 = TOTAL_TEST - lag(TOTAL_TEST),
    case.total2 = CONFIRM - lag(CONFIRM)
  ) %>%
  select( Date, test.suspicious2, case.total2 ) %>%
  left_join(
    readxl::read_excel(
      "corona/SNU ARIC Dataset for Korea.xlsx",
      sheet = "Testing"
    ) %>%
    mutate(
      Date = as.Date(Date)    
    ),
    by = "Date"
  ) %>%
  mutate(
    test.performed = Total,
    test.suspicious = ifelse(
      is.na(`의심신고 검사자 수`),
      test.suspicious2,
      `의심신고 검사자 수`
    ),
    test.temporary = `임시선별검사소 검사건수`,
    test.temporary.smetro = `수도권 임시선별검사소 검사건수`,
    test.temporary.other = `비수도권 임시선별검사소`,
    test.temporary = ifelse(
      is.na( test.temporary ),
      test.temporary.smetro + test.temporary.other,
      test.temporary
    ),
    case.temporary = `임시선별검사소 확진자 수`,
    case.temporary.smetro = `수도권 임시선별검사소 확진자 수(명)`,
    case.temporary.other = `비수도권임시선별검사소 확진자 수(명)`,
    case.temporary = ifelse( 
      is.na( case.temporary), 
      case.temporary.smetro + case.temporary.other,
      case.temporary
    ),
    case.total = `신규확진자수`
  ) %>%
  # manual fixing of dates where test.suspicious does NOT match test.suspicious2
  mutate(
    test.performed = case_when(
      Date == as.Date("2021-12-07") ~ 620656,
      Date == as.Date("2022-01-02") ~ 304362,
      Date == as.Date("2022-01-03") ~ 283868,
      Date == as.Date("2022-01-04") ~ 472079,
      Date == as.Date("2022-01-05") ~ 449192,
      Date == as.Date("2022-01-06") ~ 412383,
      Date == as.Date("2022-01-07") ~ 414436,
      TRUE ~ test.performed
    ),
    test.suspicious = case_when(
      Date == as.Date("2021-12-07") ~ 81295,
      Date == as.Date("2022-01-02") ~ 92779,
      Date == as.Date("2022-01-03") ~ 89984,
      Date == as.Date("2022-01-04") ~ 124999,
      Date == as.Date("2022-01-05") ~ 106944,
      Date == as.Date("2022-01-06") ~ 96226,
      Date == as.Date("2022-01-07") ~ 72728,
      TRUE ~ test.suspicious
    ),
    test.temporary = case_when(
      Date == as.Date("2021-12-07") ~ 205093,
      Date == as.Date("2022-01-02") ~ 100787,
      Date == as.Date("2022-01-03") ~ 98570,
      Date == as.Date("2022-01-04") ~ 156330,
      Date == as.Date("2022-01-05") ~ 129910,
      Date == as.Date("2022-01-06") ~ 126577,
      Date == as.Date("2022-01-07") ~ 129675,
      TRUE ~ test.temporary
    ),
    case.total = case_when(
      Date == as.Date("2021-12-07") ~ 4954,
      Date == as.Date("2022-01-02") ~ 3831,
      Date == as.Date("2022-01-03") ~ 3125,
      Date == as.Date("2022-01-04") ~ 3023,
      Date == as.Date("2022-01-05") ~ 4443,
      Date == as.Date("2022-01-06") ~ 4125,
      Date == as.Date("2022-01-07") ~ 3716,
      TRUE ~ case.total
    ),
    case.temporary = case_when(
      Date == as.Date("2021-12-07") ~ 1325,
      Date == as.Date("2022-01-02") ~ 1268,
      Date == as.Date("2022-01-03") ~ 997,
      Date == as.Date("2022-01-04") ~ 828,
      Date == as.Date("2022-01-05") ~ 1595,
      Date == as.Date("2022-01-06") ~ 1373,
      Date == as.Date("2022-01-07") ~ 1151,
      TRUE ~ case.temporary
    )
  ) %>%
  # postprocessing
  mutate(
    # Aggregate metrics
    test.total = ifelse(
      is.na( test.temporary ),
      test.suspicious2,
      test.suspicious + test.temporary
    ),
    case.total = ifelse(
      is.na(case.total),
      case.total2,
      case.total
    ),
    case.suspicious = ifelse(
      is.na( case.temporary ),
      case.total2,
      case.total - case.temporary 
    )
  ) %>%
  select( Date, starts_with("test"), starts_with("case") ) %>%
  filter( Date != max(Date) ) %>% # ensure that the most recent record is removed
  # For export
  mutate(
    `Source URL` = "https://sites.google.com/view/snuaric/data-service/covid-19/covid-19-data?authuser=0",
    `Cumulative total` = cumsum( ifelse( is.na(test.total), 0, test.total )  ),
    `Country` = "South Korea",
    `Source label` = "SNU ARIC (from Korea DCA)",
    `Units` = "people tested",
    `Notes` = ifelse(
      Date >= as.Date("2020-12-18"),
      "suspicious report testing + testing at temporary screening stations",
      "only number of suspicious report testing"
    ),
    `Source URL` = case_when(
      Date == as.Date("2021-12-07") ~ "https://www.kdca.go.kr/board/board.es?mid=a20501010000&bid=0015&list_no=717863&cg_code=&act=view&nPage=11",
      Date >= as.Date("2022-01-02") & Date <= as.Date("2022-01-07") ~ "https://www.kdca.go.kr/board/board.es?mid=a20501010000&bid=0015&list_no=718241&cg_code=&act=view&nPage=1",
      TRUE ~ `Source URL`
    ),
    `Source label` = case_when(
      Date == as.Date("2021-12-07") ~ "Korea DCA",
      Date >= as.Date("2022-01-02") & Date <= as.Date("2022-01-07") ~ "Korea DCA",
      TRUE ~ `Source URL`
    ),
    `Daily change in cumulative total` = test.total
  ) %>%
  select( Date, `Source URL`, `Cumulative total`, `Country`, `Source label`, `Units`, `Notes`, `Daily change in cumulative total` )
WWolf commented 2 years ago

@camappel Hi, it looks like after 2022-01-10, Korea test data is not being updated properly. Probably related to the refactor? https://github.com/owid/covid-19-data/pull/2262

P.S. I think as I had just a glitch of refresh issue. Everything looks OK.

WWolf commented 2 years ago

Korea DCA changed the usual daily statistics on testing as of 2022-02-07. New statistics is coming in as of 2022-02-08. Due to the Omicron wave, they have revamped the whole Test&Trace&Isolate strategy. They switched from collecting the 의심신고 검사자 수 and 임시선별검사소 검사건수 and aggregate now into 선별진료소(통합). It could be translated something like Number of testing at screening stations (Aggregate). Based on the numbers, it looks like it is still people tested not test performed.

Below is a screenshot from 2022-02-09 daily report:

image

The SNU ARIC also made the change by adding an additional column, 선별진료소(통합). Due to the transition, there is no corresponding test data on 2022-02-07.

camappel commented 2 years ago

감사합니다 @WWolf ,

I have addressed this issue in #2424 .

WWolf commented 2 years ago

Dear OWID,

It looks like that SNU ARIC terminated aggregating testing stats such that Korean stats are stale as of 2022-06-15 (the reasoning is that the Korean government officially downgraded the disease for "living with covid"). The government official report regarding testing data is still published daily (in Korean), with some revisions for up to a week.

I have added to my github repository a testing data csv file which I copied over the SNU ARIC data (with some revisions of the last few days due to revisions) with the most up-to-date stats from KDCA after the termination. As I update semi weekly the hospitalization data, I will update this as well for your reference (because I am updating roughly weekly, some of the days will have minor discrepancies due to later revisions). Hope this helps. @lucasrodes @camappel

https://github.com/WWolf/korea-covid19-hosp-data/blob/main/testing.csv