Which "data" folder, and what about ~2 million SARS-COV-2 genome sequences?

jielab commented 3 years ago

Hi, there:

I am trying to follow the Github instructions to test CovdGenotyper.

I am wondering wherther I should create a "data" folder in the same level as CovidGenotyper or within it.

The following 3 lines mentioned the "data" folder sequentially. The "mkdir data" command seems to be put in the wrong place.

cp covid.fa data/ncov_ref_NC_045512.fasta
....
mkdir data; mv ncov_NC_045512_Genes.GFF3 data
....
saved as gisaid_cov2020sequences[mmm_dd].fasta in the data folder

BTW, I found that http://covidgenotyper.app is not available.

Just curious, can CovidGenotyper process ~2 million SARS-COV-2 genomes in the GISAID database now?

Thank you & best regadrs, Jie

hsmaan commented 3 years ago

Hi Jie,

I've made the corrections to the README - the data folder should be created at the top level of the CovidGenotyper directory.

Please let me know if there are any other issues with the local install. Currently the website is not available as we are upgrading to a python implementation. Unfortunately 2 million sequences will be too much for the application to handle - for both the backend processing and the visualization. I would suggest downsampling the data on GISAID in a stratified manner to get a good representation of the global data - up to 20'000 sequences can be supported by the visualization. We are looking to add support for more sequences in the python implementation.

Thank you for your feedback on the application install. Will update you here on any progress on the python implementation.

jielab commented 3 years ago

Dear Hassaan:

Thank you very much for upding the README file!

Maybe a couple of minor things could still be updated, for example:

after "unzip snpEff", there is a folder of "snpEff" and another folder of "clinEff". Do we ignore the "clinEff" folder?
the documention says to download the ncov_ref_NC_045512.fasta file, but that file already exists after I run "git clone". For small files, it would be nice for them to come with the "git clone" packages.

BTW, I do have a list of quick questions on the bioinformatics analysis of COVID data. I would deeply appreciate if you could shed some light.

It says that "CGT relies on pre-processing plot data prior to deployment to ensure visualizations can be loaded quickly". I think the selection of which pre-processing data would affect the plot a lot. It would be good to give some recommendation here.
Although (CGT) is an R-Shiny based web application, I could not run it on Dreamhost.com where I purchased an account, because I don't have root access. Don't know if you have some recommendation on this.
Previously I have been working with human genome data. I always start with the FASTQ file, and then convert to BAM file. We now only have FASTA file from GISAID, no FASTQ file. Is it because virus genome is small and very easy to sequence with high confidence, and therefore, no quality measurement is needed?
I think GISAID used WIV04 as reference for alignment, but CGT used Wuhan-Hu-1, which has 29903 base. Is there a reason for this? The published sequence of VIW04 (ncbi.nlm.nih.gov/nuccore/MN996528.1) ended with gene="N". So, where could I find the positions for ORF10 and 3'UTR for this reference WIV04?
From GISAID, I assume that the most commonly used data for downloading should be the 3 MSA files under “Alignment and proteins”. I found the FULL data has 36,801 letters for each genome. Why it is much more than the reference (N=29,891)? The Unmasked and Masked file both have 29,891 letters, but it seems that “-” and “n” are different in these two datasets. Does “-” mean a gap (after alignment) and “n” means missing (not sequenced), correct? We all know face mask these days. What are exactly “masked” in the Masked file?
For the FASTA file under the section of “Download packages”, it has a total of 29,862 letters for each genome. Why it missed 29 base compared to the reference? Is this the unaligned raw file that Nextstrain and other alignment software usually use as input? I am also a bit puzzled to find that I could not find the sample EPI_ISL_426900 in this data, which is the first sample in the MSA files mentioned above.
When we get the FASTA file of the reference genome, is there a simple and straight-forward bioinformatics approach to identify what proteins are coded by this genome, and the start and end positions of each of the coded proteins? I thought that this is something seemingly very easy to do, based on the central dogma of molecular biology. However, I did not find an easy one-liner to do this. Do we must go through some complicated BLAST process in order to find out what genes are coded by the SARS-COV-2 genome?
Now, everybody is talking about Delta variant. The WHO has this page https://www.who.int/en/activities/tracking-SARS-CoV-2-variants. However, once I got a SARS-COV-2 genome data, do I must run a phylogeny analysis to get a “21A” from Nextstrain and a “B.1.617.2” from PANGO to declare a Delta variant? For example, does a Nextstrain clade of “21A” definitely mean “Delta”, and vice versa? Is there a one-to-one relationship between Delta and certain mutations? I had hoped to find this information from the GISAID metadata file that I downloaded from the “Download packages” section. However, I am very suspicious of the quality of the metadata file in this “Download packages” section. For example, the maximum value of “Sequence.length” in this file is 148,351, which does not seem to make sense.
Sometimes, a single mutation is enough to make a Delta variant. Other times, another Delta variant might have 10 mutations. Is there a mathematic formula or prediction algorithm to calculate the cumulative effect of 10 mutations vs. 1 mutation, for example?

Thank you very much & best regards, Jie

hsmaan / CovidGenotyper

Which "data" folder, and what about ~2 million SARS-COV-2 genome sequences? #41