jihoo-kim / Data-Science-for-COVID-19

DS4C: Data Science for COVID-19 in South Korea
182 stars 59 forks source link

Does the data contain all COVID patients? #8

Open rosschu opened 4 years ago

rosschu commented 4 years ago

First of all, thanks for the great data resource. In PatientInfo.csv, I'm noticing that it only has around 2800 entries, while I'm aware that Korea has over 9000 patients at this point. Will the additional 6000 entries come in some time soon?

jihoo-kim commented 4 years ago

About 6000 patients are confirmed in Daegu, and Daegu Metropolitan City Hall does not open the information due to the rapid increase in the number of confirmed people. It means we are not sure to update the data of 6000 patients.

2020년 4월 2일 (목) 오후 11:41, Ross Chu notifications@github.com님이 작성:

First of all, thanks for the great data resource. In PatientInfo.csv, I'm noticing that it only has around 2800 entries, while I'm aware that Korea has over 9000 patients at this point. Will the additional 6000 entries come in some time soon?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jihoo-kim/Data-Science-for-COVID-19/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDXMG6MRTFNV6ZRBFXSGE3RKSPZZANCNFSM4L2QR3UA .

rosschu commented 4 years ago

I see, thanks for that info.

Here's another idea I had for adding to PatientRoute.csv if this interests you:

Idea: Can we augment the PatientRoute.csv file with data on visitor traffic to these locations? For instance, patient #5 in 광진구 visited a store on Jan 26, and we'd like to know how many people visit that store on a typical day. We can do this with location data from map services like T-Map, Kakao Map, or Naver Map.

Why this is useful: Tracking the number of people visiting each location allows us to estimate the likelihood of disease transmission at each location, as follows:

Likelihood of Transmission = (Likelihood of exposure) X (Likelihood of infection, conditional on exposure)

Calculating this likelihood would be extremely useful both for modeling the spread of COVID19 in a region and for resource-constrained governments who wish to prioritize who to test based on individuals with the highest probability of transmitting the disease to others

Data Source: I noticed that this project was sponsored by SKT, so maybe we could reach out to them for data on visitor traffic from their T-Map app? We could anonymize the data by aggregating the number of visits at the location-date level, which would resolve consumer privacy issues

P.S. I'm currently based in Korea, so PM me separately if you'd like to discuss anything over the phone!

rosschu commented 4 years ago

To elaborate, here are some examples I'm thinking of:

Kakao mobility report (describes the type of data we will need) https://brunch.co.kr/@kakao-it/36

T-Map published the info we need (유동인구수), but the geographic unit is too large (rather than 시군구 level, something at the building/location level would be more necessary) (https://www.bigdatahub.co.kr/product/view.do?pid=1002286)

T- Map also publishes popular searches at the location level (검색지명). https://www.bigdatahub.co.kr/product/view.do?pid=1002290

Ideally, what we would want is the combination of the above two datasets: 유동인구 at each 검색지명, recorded at a daily frequency.