Scrape VOCs, etc from more agencies

atc3 commented 3 years ago

For PHE, find the latest HTML report, go to that link, then scrape the table on the resulting page

atc3 commented 3 years ago

Updated task:

Write separate webscrapers for each agency, output into individual JSON files. Not crucial that we scrape every single metadata column - only requirement is that we get the pair of PANGO lineage + classification (i.e., B.1.1.7 + VOC)
Write a snakemake rule to combine results from each individual webscraper. Aim for a compact CSV format, where rows = PANGO lineages and columns = agencies, and cells = classification (VOC, VOI, etc).
This CSV can be exported into web-friendly JSON with .to_json(orient='records'). This will produce an array of {"lineage": "B.1.1.7", "WHO": "VOC", "CDC": "VOC", ...}
Separate snakemake rule for pulling PANGO lineage – WHO convention map. i.e., "B.1.617.2": "Delta"

atc3 commented 3 years ago

Additional cleaning:

If multiple lineages are defined in one option, split them into two
Remove additional mutations in the definition. i.e., B.1.1.7 + P384L
Collapse duplicate lineages by keeping the most severe one. i.e., if one lineage is both VOC and VOI, then drop the VOI entry but keep the VOC one.

atc3 commented 3 years ago

Closed by #357

vector-engineering / covidcg