nextstrain / ncov

Nextstrain build for novel coronavirus SARS-CoV-2
https://nextstrain.org/ncov
MIT License
1.35k stars 403 forks source link

4 Philosophical and Scientific questions on nextStrain/ncov #340

Closed jielab closed 4 years ago

jielab commented 4 years ago

Hi, there:

Deeply appreciate again that you guys have developed a tremendously useful and powerful tool!

I now got a few Philosophical and Scientific questions. Hope some of you can kindly shed some light.

  1. Researchers around the world has used coronavirus Phylogenic analyses results to publish many papers, many of them with just <100 samples. I am very surprised that NextStrain did not seem to have published papers when it presents Phylogenic analysis for thousands of coronavirus genomes. Is there a reason for this?

  2. Many published papers seemed to use tools such as MEGA. I found that I could use MEGA to only handle tens of virus genomes on my laptop. So, is there still a need for MEGA if I just want to do Phelogenic analysis? I assume that MEGA can do other analyses such as amino acids alignments which could not be done by NextStrain.

  3. When I click the “CLOCK” link on nextstrain.org/ncov, it says “real estimates: 23.89 subs per year”. Now it has been only 4 months since December 2019, only 1/3 of a full year. So, there are only ~ 8 subs (i.e. mutations) among all the ~3,000 coronavirus genomes analyzed by Nextstrain so far? My understanding must be wrong. Otherwise, 8 mutations should not be enough to draw such a complex phylogenic tree. On the other hand, I did see the Y axis label says “mutation” and the maximum value on the current plot Y axis is 20. So, there are a only maximum of 20 mutations among ~3,000 so far? I am a bit confused here…

  4. On this page https://nextstrain.org/docs/getting-started/local-installation, under “Table of Contents”, there are 6 items. I initially thought that I need to follow all these 6 steps one by one. It turns out that I only need to do the first step, that is, Install Augur & Auspice with Conda (recommended). So, what is the difference between the regular Anaconda and this Miniconda? Also, I am a bit confused by the fact that you guys have this nextStrain.org website which talks about how to install Auspice and Augur and then also have a github.com/nextstrain which also includes information for Auspice and Augur. So, what is exactly the division of information between nextStrain.org and github.com/nextstrain?

Thank you very much & Best regards, Jie

jameshadfield commented 4 years ago

Hi Jie, thanks for reaching out :)

(1) We believe that distributing continually up-to-date analysis through nextstrain.org, using data shared through GISAID, is a valuable and important contribution to scientific understanding. We do not see this as a replacement for formal publications. Posts on virological.org are similar in this respect.

(2) Our ncov analysis pipeline (contained in this repo) does not use it, but MEGA has been (and continues to be) important for many bioinformatics use cases. Nextstrain (via Augur) does not currently do amino-acid alignments.

(3) That estimate (~24subs/yr) is based on a root-to-tip analysis of the data, as generated under the temporal model implemented in TreeTime and run via this repo. You can see from the clock-layout that there is variability in the clock signal within the data. That some genomes, relative to the root, have 15-20 SNPs is not unexpected. (There are other possibilities which should be considered such as sequencing errors, but we try to identify these and exclude them if possible.)

(4) We provide a number of different ways by which to install the software which, taken together, makes up Nextstrain. All the code for our software (and analysis pipelines) is open-source and available as GitHub repos within github.org/nextstrain. You can see an overview of the different parts of Nextstrain here.

jielab commented 4 years ago

Hi, guys:

Thank you so much again for replying to my previous questions. I am still striving to figure out how to read some key information correctly from nextstrain. It would be really nice if you guys have a tutorial on how to read the plot, with a FAQ to address questions like the following:

  1. When I open the homepage and click the “Clock” button, the Y axis label says “mutations”. The dot on the top right says “Divergence 24.16; Date:2020-03-27”. Does “divergence” here mean the same thing as “mutations”? If so, how could a virus genome has 24.16 mutation? The Y-axis label did say “mutations”.

  2. I guess the date on the X-axis is the reported date, not genetically inferred date. Otherwise, the virus with the biggest divergence should have the latest date, correct?

  3. The top of the plot shows “rate estimate: 26.657 subs per year”. Does “sub” here means “mutation” too?

  4. For the “Map options” select button on the left, no matter which option I selected, the plot stays the same.

  5. It says that “site numbering and genome structure uses Wuhan-Hu-1/2019 as reference. The phylogeny is rooted relative to early samples from Wuhan”. Can I still use nextStrain if my own genome data does not include a root genome like Wuhan-Hu-1/2019 therefore the tree would not have a root at all?

  6. It says that “the JSON tree data underlying this visualization is available at …”. So, I guess the JSON file is all that needed to get the visualization. Let’s say that I have too many samples that nextStrain can’t handle at one run. Then I might need to subset the samples into two separate dataset and run them separately. Is there a way for me to merge the two JSON files into one that can generate a single visualization?

  7. Right now there is a drop-down menu to visualize a subset of samples by virus categories and by geographic regions. This is very nice. However, is there a more flexible way to only visualize a single country or my own list of samples? For example, one might want to display the phylogenic tree for 10 particular genomes.

  8. Is the paper “Nextstrain: real-time tracking of pathogen evolution” published on Bioinformatics to cite if I use nextStrain tool to publish?

  9. I am still amazed by the fact that I could install nextstrain locally into my not-so-powerful laptop and then generate a phylogenic tree for thousands of samples. My laptop could not even deal with 100 genomes if I use MEGA-X. So, what technology made the difference?

  10. Currently 3251 genomes are displayed. Is there a way to find out how many mutations and unique haplotypes and distinct haplogroups are there? I guess it is likely that some virus have 100% identical genome sequences with others. Don’t know how you guys define a haplotype and a haplogroup here. This information is key for us to understand how fast the virus mutates.

I realize that this list of questions is a bit long. If you could only answer one question, can you please kindly answer the last question shown in bold text? This one is very important for me to grab the essence of nextstrain. I assume that other users might want to know this as well.

Thank you guys so much!

Gob bless the World!

Best regards,

Jie

发件人: james hadfield notifications@github.com 发送时间: 2020年4月8日 11:01 收件人: nextstrain/ncov ncov@noreply.github.com 抄送: jiehuang001 jiehuang001@gmail.com; Author author@noreply.github.com 主题: Re: [nextstrain/ncov] 4 Philosophical and Scientific questions on nextStrain/ncov (#340)

Hi Jie, thanks for reaching out :)

(1) We believe that distributing continually up-to-date analysis through nextstrain.org, using data shared through GISAID, is a valuable and important contribution to scientific understanding. We do not see this as a replacement for formal publications. Posts on virological.org are similar in this respect.

(2) Our ncov analysis pipeline (contained in this repo) does not use it, but MEGA has been (and continues to be) important for many bioinformatics use cases. Nextstrain (via Augur) does not currently do amino-acid alignments.

(3) That estimate (~24subs/yr) is based on a root-to-tip analysis of the data, as generated under the temporal model implemented in TreeTime and run via this repo. You can see from the clock-layout that there is variability in the clock signal within the data. That some genomes, relative to the root, have 15-20 SNPs is not unexpected. (There are other possibilities which should be considered such as sequencing errors, but we try to identify these and exclude them if possible.)

(4) We provide a number of different ways by which to install the software which, taken together, makes up Nextstrain. All the code for our software (and analysis pipelines) is open-source and available as GitHub repos within github.org/nextstrain. You can see an overview of the different parts of Nextstrain here https://nextstrain.org/docs/getting-started/introduction#open-source-tools-for-the-community .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextstrain/ncov/issues/340#issuecomment-610724966 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNS67YDFBOXF4WSSJ4KYW3RLPSHNANCNFSM4MDSRRAQ . https://github.com/notifications/beacon/AGNS67422ZBH3TBTCG5SVKLRLPSHNA5CNFSM4MDSRRA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOERTOYZQ.gif