nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

Question about mutations in mut_nt.json and mut_aa.json #357

Closed cornhundred closed 4 years ago

cornhundred commented 4 years ago

Are the mutations listed for each strain (e.g. nt_muts.json) the only mutations for that strain or are we to infer that it has the mutations inherited from it's clade (e.g. what's described with the branching of strains here: https://nextstrain.org/help/general/how-to-read-a-tree)? I'm just surprised that the mutations are so sparse.

I normally work with gene expression data and this is my first experience with virus mutation data. Thanks for sharing ncov - it was very easy to get running.

cornhundred commented 4 years ago
Screen Shot 2020-04-13 at 9 26 59 AM

Here's quick attempt to visualize (low-res image heatmap image) with mutations as rows and strains as columns (with clade categories) showing sparseness.

emmahodcroft commented 4 years ago

Hi @cornhundred , thanks for reaching out! Yes, the mutations listed are both the mutations for that node/tip, but also those the tip inherits from the internal nodes which lead to it (from root to tip). So, the tree should be 'traversed' to count up the total number mutations at the tip. I hope that is helpful!

However, it is worth keeping in mind that this virus is not very diverse yet, so the mutations will remain fairly sparse, at least compared to many other viruses!

cornhundred commented 4 years ago

Thanks @emmahodcroft! So just to clarify, are the mutations in the nt_muts.json

 "nodes": {
    "some-node: {
      "muts": [
        "mutation-1",
        "mutation-2",
        "mutation-3"
      ],
    ...

are a comprehensive list of mutations for this node, such that I don't have to traverse the tree to add the inherited mutations?

emmahodcroft commented 4 years ago

Ah, sorry - I wasn't clear in my answer. No, you will have to traverse the tree to get all the mutations. The ones at the tips/nodes are only the ones unique to that tip. The others that are shared with other tips will be listed on the internal nodes. So yes, you'll have to traverse the tree to get the full list. Sorry I wasn't clear enough the first time!

cornhundred commented 4 years ago

@emmahodcroft thanks again. So, it sounds like I can parse the Newick tree and assemble all the aquired mutations for each strain. I'll probably try assembling from the Auspice JSON (https://nextstrain.org/docs/bioinformatics/data-formats) first.

Another quick question, I'm seeing ~6,000 strains from GISAID for COVID and Nextstrain has ~3,500 strains on Nextstrain, Are you all performing extra filtering? I have to read your documentation more carefully so please excuse the simple questions.

emmahodcroft commented 4 years ago

@cornhundred Glad this helped! Yes, you should be able to walk through the Auspice JSON. Alternatively, you can match up the tree output from refine with the nt_muts.json, and this'll get you the same results!

Yes, in our global build we are subsampling to 150 samples per division, per month, per year. So, that cuts our total number down. We are thinking about ways to have a completely 'full' tree available online. In the meantime, if you have the compute power, you should be able to run this 'full' tree using snakemake all (not including the -s Snakefile_Regions). It depends on your computer, but this could take quite a while (hours) to run - I'd recommend using a cluster if you can!

cornhundred commented 4 years ago

Thanks @emmahodcroft.

One last mutation-related question about sharing. Later we might want to share a visualization (something like https://github.com/cornhundred/citibike-clustergrammer2) of this mutation data and be in compliance with the GISAID sharing policy. So we're thinking we might not show actual mutations across strains and instead just rename them or something. I know that your Nextstrain shows mutations on mouseover so I just wanted to get your group's opinion - we will probably check with GISAID before sharing anything.

nahid18 commented 3 years ago

you will have to traverse the tree to get all the mutations

Hi @emmahodcroft, I am working on this but is there any script available from nextstrain that does this already? If available, then I won't need to do it by myself, and will use the ready-made one.

Thanks in advance.