reconhub / covid19hub

Community-driven COVID-19 analytics in R
58 stars 2 forks source link

Functions to import reported daily sub-national new cases and deaths from LMIC #5

Open ffinger opened 4 years ago

ffinger commented 4 years ago

Description

A function for each country that accesses public data sources and makes the data accessible in R. Good examples in this package: https://github.com/epiforecasts/NCoVUtils for a number of countries.

Functions are existent for most European and Asian countries and the US. We are looking for data and functions for LMIC at the moment, especially African countries.

I suggest we use this issue to keep track of

1) Needs for data and R functions for specific countries 2) Available data sources for specific countries 3) Functions that import the data sources identified in 2.

See below for google sheet tracking those.

Output

The output format of each function should be a long data frame containing the following columns:

where cases and deaths stand for the newly reported cases and deaths on that day.

Functions can either be added to https://github.com/epiforecasts/NCoVUtils via pull request, or we can start our own package that wraps NCoVUtils and other solutions for the already implemented countries.

Countries already covered:

https://github.com/epiforecasts/NCoVUtils covers the following countries so far:

Countries to be done

Spreadsheet to track requested countries, data sources and implementations: https://docs.google.com/spreadsheets/d/1uvg07BAmwKqLqhKvkejhkX7uvXiGCre4sz11Au3pz9Q/edit?usp=sharing

my own priority list:

Links

A few places where data sources are indexed:

https://data.humdata.org/event/covid-19 https://coronavirustechhandbook.com/home https://www.europeandataportal.eu/data/datasets?locale=en&categories=heal&page=1&query=covid

ffinger commented 4 years ago

@patrickbarks @paulc91 @ntncmch @scottyaz @sbfnk @epiforcasts @seabbs @hamishgibbs @jhellewell14

scottyaz commented 4 years ago

Would be good to start a google spreadsheet (if it doesn't exist) with sources for each country. For example http://covid19.health.gov.mw is a good source for Malawi.

seabbs commented 4 years ago

We are keen to have contributions to our package but also happy for this to be a separate project if that makes sense. We've been talking about how/if we want to support it as a more widely known data resource and that is seeming to make more and more sense.

ffinger commented 4 years ago

Google spreadsheet to track requests for countries, data sources and implementations: https://docs.google.com/spreadsheets/d/1uvg07BAmwKqLqhKvkejhkX7uvXiGCre4sz11Au3pz9Q/edit?usp=sharing

ffinger commented 4 years ago

@seabbs, happy to contribute to NCoVUtils

xt-21 commented 4 years ago

Would like to work on Burkina Faso

ColinFay commented 4 years ago

Hey,

Can you pitch on the process of contributing?

Do you want us to PR {NCovUtils}?

Also, there is: https://www.worldometers.info/coronavirus/

Here's a fun to get today and yesterday df:

get_worldmeter_df <- function(){
  url <- xml2::read_html(
    "https://www.worldometers.info/coronavirus/"
  )
  tbls <- rvest::html_table(url)
  tbls[[1]] <- tbls[[1]][8:nrow(tbls[[1]]),]
  tbls[[2]] <- tbls[[2]][8:nrow(tbls[[2]]),]
  list(
    today = tbls[[1]], 
    yesterday = tbls[[2]]
  )
}
get_worldmeter_df()

Has Burkina Faso, Irak, Congo and Syria

get_worldmeter_df()$today[
  tod$`Country,Other` %in% c("Burkina Faso", "Irak", "Congo", "Syria"),
]
    Country,Other TotalCases NewCases TotalDeaths NewDeaths
99   Burkina Faso        484                   27          
147         Congo         60                    5          
166         Syria         25                    2          
    TotalRecovered ActiveCases Serious,Critical
99             155         302                 
147              5          50                 
166              5          18                 
    Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop
99                23             1                        
147               11           0.9                        
166                1           0.1                        
    Continent
99     Africa
147    Africa
166      Asia
ffinger commented 4 years ago

Hi @ColinFay, yes, the best is to PR NCovUtils.

I haven't seen any sub-national data (by region, province or similar) on wordlometers, am I missing something?

Think it would still be a good additional resource to the already existing functions to get national data from ECDC, WHO, JHU or similar, especially since there seems to be data on testing.

ColinFay commented 4 years ago

@ffinger not that I know of

ColinFay commented 4 years ago

Possible other source for Burkina Faso : https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19

Need to scrape the pdf(s)

ffinger commented 4 years ago

Thanks for this! Anyone having time to implement scraping?

ffinger commented 4 years ago

There is a figure here too, sources are probably the previously linked sitreps: https://fr.wikipedia.org/wiki/Pand%C3%A9mie_de_Covid-19_au_Burkina_Faso

image

Probably possible to scrape since the data seems to be in the code of the figure (click on modify code to see).

ColinFay commented 4 years ago

Here's the code to download all the pdfs:

dir.create("burkina_covid")
for (
  i in c(
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19", 
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=1", 
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=2", 
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=3"
  )
){
  url <- xml2::read_html(
    i
  )
  but <- rvest::html_nodes(url, ".dropdown-menu a")
  lapply(
    rvest::html_attr(but, "href"), 
    function(x){
      download.file(
        x, 
        file.path(
          "burkina_covid", 
          basename(x)
        )
      )
    }
  )
}

> fs::dir_tree("burkina_covid/")
burkina_covid/
├── covidresponseplanremarks-french.docx
├── ghrp-covid19-en.pdf
├── ghrp-covid19-fr.pdf
├── integration_du_covid-19_dans_la_reponse_humanitaire.pdf
├── plan_de_riposte_covid19-revise_def.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_centre_nord_fevrier_2020-1.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_centre_nord_fevrier_2020-1_1.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_sahel_fevrier_2020.pdf
├── sitrep_n27_du_24_03_20.pdf
├── sitrep_n_29_0.pdf
├── sitrep_n_32_covid-19_du_29_mars_2020_0.pdf
├── sitrep_n_33.pdf
├── sitrep_ndeg17_du_14_03_20.pdf
├── sitrep_ndeg21.pdf
├── sitrep_ndeg24_du_21_03_20.pdf
├── sitrep_ndeg25.pdf
├── sitrep_ndeg28.pdf
├── sitrep_ndeg35_0.pdf
├── sitrep_ndeg_20_du_17_03_20.pdf
├── sitrep_ndeg_22_du_19_03_20.pdf
├── sitrep_ndeg_26_du_23_03_20.pdf
├── sitrep_ndeg_31.pdf
├── sitrep_ndeg_34.pdf
├── sitrep_ndeg_36.pdf
├── sitrep_ndeg_37.pdf
├── sitrep_ndeg_38_covid_bfa_au_04_04_2020.pdf
├── sitrep_ndeg_39_1.pdf
├── sitrep_ndeg_40_0.pdf
├── sitrep_ndeg_41_au_7_avril_2020_1.pdf
├── sitrep_ndeg_42_covid-19_burkina_faso.pdf
├── sitrep_ndeg_43.pdf
└── sitrep_ndeg_44.pdf
ColinFay commented 4 years ago

Here's a piece of code to extract data from the latest pdf:

library(tabulizer)
res <- tabulizer::extract_text("burkina_covid/sitrep_ndeg_44.pdf")
res <- strsplit(res, "\n")[[1]]
num_extr <- function(
  res, txt
){
  gsub(
    "[^:]*: ([0-9]*).*", 
    "\\1", 
    grep(txt, res, value = TRUE)
  )
}

cont <- c(
  "Cumul personnes contacts listées",
  "Contacts confirmés COVID-19 depuis le début", 
  "Nbre de contacts sortis de suivi ce jours", 
  "Cumul de contacts sortis après 14 jours de suivis", 
  "Nombre de contacts à suivre", 
  "Nombre de contacts vus", 
  "Nombre de contacts non vus", 
  "Nombre de contacts devenus suspects", 
  "Nombre de nouveaux contacts"
)

x <- sapply(
  cont, function(x){
    num_extr(res, x)
  }
) 

tibble::rownames_to_column(
  as.data.frame(x), 
  "type"
)
                                               type    x
1                  Cumul personnes contacts listées 2409
2       Contacts confirmés COVID-19 depuis le début  272
3         Nbre de contacts sortis de suivi ce jours   31
4 Cumul de contacts sortis après 14 jours de suivis 1076
5                       Nombre de contacts à suivre 1061
6                            Nombre de contacts vus    1
7                        Nombre de contacts non vus   19
8               Nombre de contacts devenus suspects   10
9                       Nombre de nouveaux contacts   99

I'm french so these seems to be the interesting part, but as I'm no expert in the field that would be nifty to have s.o with domain knowledge pointing me to the interesting part of the pdf.

PaulC91 commented 4 years ago

nouveaux cas confirmés et décès par district seraient super. but it doesn't there is any pattern in the way this information is given in the pdf (unlike the suivi des contacts section above), so I'm guessing it would be difficult to scrape consistently.

ColinFay commented 4 years ago

here's a attempt at a package to download and scrape data: https://github.com/ColinFay/covidbf

Let me know if you want me to work more on this.

ffinger commented 4 years ago

@ColinFay thanks a lot. As mentioned by @PaulC91 the information you are scraping is the reports about contact tracing. The new cases per region or per district are hidden in the text and not consistently reported it seems. There is also the map at the beginning that gives new cases by district, but very hard to scrape I believe...

ColinFay commented 4 years ago

Just to check, have you tried contacting the people listed at the bottom of the pdf? They might be willing to share the data

ColinFay commented 4 years ago

Oh and, what's LMIC?

xt-21 commented 4 years ago

Low to middle income country

On Sun, Apr 12, 2020 at 9:12 PM Colin Fay notifications@github.com wrote:

External Email

Oh and, what's LMIC?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/reconhub/covid19hub/issues/5#issuecomment-612662606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKFPHCKERQFDUDSC2ZPCTQLRMIHDNANCNFSM4MEQPAQA .

ffinger commented 4 years ago

@ColinFay yes, we are in contact with authorities.

ffinger commented 4 years ago

I added some new countries and potential data sources to the spreadsheet.

See here for details: https://github.com/epiforecasts/NCoVUtils/issues/72#issuecomment-620715775