vector-engineering / covidcg

A COVID-19 CoV Genetics (CG) browser to inform therapeutics development
https://covidcg.org
MIT License
26 stars 5 forks source link

Download SNV frequency table and lineage table filtered by location/date range #437

Open rosekantor opened 2 years ago

rosekantor commented 2 years ago

Hello,

This site is my go-to resource for identifying lineages based on mutations in wastewater, and I've been sharing it with colleagues who really appreciate it, too. I am now looking specifically for mutations common in California within a date range (for example to check my primers against the prevalent sequences or to know what SNVs I might expect to find in specific wastewater samples).

There are two tables I'm interested in being able to download:

  1. A table of the data shown in the second bar from the left on the locations page (says "AA SNVs" or "NT SNVs" on the top, depending on the applied filters) - see screenshot below.
  2. A filtered version of the full lineage table obtained by clicking "download" > "consensus mutation". Here, the applied filters do not appear to have any effect- perhaps should be a separate issue.

Thanks in advance,

Rose

Screen Shot 2021-11-11 at 3 56 51 PM

atc3 commented 2 years ago

Hi Rose,

A table of the data shown in the second bar from the left on the locations page (says "AA SNVs" or "NT SNVs" on the top, depending on the applied filters) - see screenshot below.

To get the data for the legend, you can select "Download Aggregate Data" (see picture below)

This results in a CSV file, where each unique combination of mutations is aggregated to a row. The mutations are in the form pos|ref|alt, and are delimited by semicolons. To get the frequencies of single mutations, you'll have to pull apart that mutation string and count each mutation as you go through the rows

I understand this is a bit of work, so I'll make a download item that splits this up and collapses by date like you requested.

A filtered version of the full lineage table obtained by clicking "download" > "consensus mutation". Here, the applied filters do not appear to have any effect- perhaps should be a separate issue.

You're correct - the consensus mutations are calculated across the entire dataset and are not computed based on the user's selections.

I can add a checkbox into the "download" -> "consensus mutation" dialog that specifies whether to use the whole dataset or just the sequences from the user selection.

We're currently working on a refactor of some of the core components of the site – so these changes can't be implemented immediately... maybe a week or two? I'll let you know when it's live.

Albert

atc3 commented 2 years ago

Hi Rose,

Apologies for the late reply to this.

1) I've added a download for this data, it's named "Group Counts" under the download button

2) I added another download endpoint for consensus mutations, that also filters on date ranges, locations, etc. Right now it's not linked up to any part of the site (still figuring that out), but it's available as an API endpoint. I've described it here: https://github.com/vector-engineering/covidcg/blob/master/API.md#dynamic-group-mutation-frequencies