Closed nigiord closed 5 years ago
@nigiord Hi Nils, thanks once again for your message.
First thing to note, species heterogeneity is calculated before an extra layer of refinement step after GRiD refined is calculated. Clearly in your example, the refined GRiD value for genome NHFF01 was still greater than 10 and thus discarded and simply assigned a value of 1 (i.e. inactive). We deliberately leave these results in the output so the user can evaluate for themselves. Similarly, GRiD calculations for genomes with coverage medians <0.15 are discarded and would simply have GRiD and unrefined GRiD values of 1.
Second, including those extra parameters will significantly increase the runtime since it involves multiple subsampling steps.
Also, how many contigs in that genome in question? In the multiplex mode, we use the -a option in bowtie to output multiple alignments per read in order to be efficiently reassigned by pathoscope. The -a option isnt included in the single mode. The differences in your results may be due to the high contamination or fragmentation of your genome
In summary, the output of your results are not strange and are indeed expected.
Finally, regarding additional enhancement for paired-reads and --overwrite option, that could be something i could look into later on. At the moment, I am tied with other projects for the next couple of weeks.
Cheers, Tunde
Hi @aemiol,
Thank you for your answer. It didn't occurred to me the filtered out GRiD values were set at 1, so most of it makes sense now.
Also, how many contigs in that genome in question? In the multiplex mode, we use the -a option in bowtie to output multiple alignments per read in order to be efficiently reassigned by pathoscope. The -a option isnt included in the single mode. The differences in your results may be due to the high contamination or fragmentation of your genome
The genome in question has 45 contigs. I didn't used the -p
option so I do not think Pathoscope is involved. I guess there's not much to expect from a GRiD value > 1000 anyway, so the discrepancy can probably be explained by the high coverage variability for the contigs involved.
No problem for the additional enhancement, I'll make a pull request if I manage to craft something for my own use case.
Cheers, Nils
Hi and thank you again for providing GRiD to the community :),
I've been trying to use GRiD for a large scale analysis (thousands of genomes, hundreds of samples with expected high diversity). I first tested GRiD on a subset of my data (50 genomes, 25 samples). This is the procedure I followed so far (GRiD 1.3) :
update_database -d ./Index -g ./WGS -p "custom"
grid multiplex -r ./{sample} -e "fastq.gz" -o ./{sample}_output -d ./Index -c 1 -n 11
with{sample}
the name of the sampleAs expected I obtain two files for each sample:
{sample}.GRiD.txt
and{sample}.pdf
. Below is an example of output for one of the sample (results are similar in other samples):I have several questions regarding the values obtained.
1) Most of the refined GRiD values are set to 1. Is that normal/expected? What about the GRiD_unrefined value that is at 1495.15 and is simply reported as 1 after refinement? I would have expected the genome NHFF01 to be simply filtered out since "GRiD values greater than 10 are discarded as this may be due to high coverage of a contaminant contig". 2) From your publication,
I've understood the procedure thanks to Supp. Fig. 1C, but I do not understand the rational behind it. Especially in regard with the refinements that I obtained above (How could I go from 1495.15 to 1?).
3) From your publication, I thought the species heterogeneity was computed as
1 - (GRiD_refined/GRiD_unrefined)
but it does not seem to work for most of the values above (for instance, NHFF01 should have a species heterogeneity of1 - (1/1495.15) = 0.9993
and not0.4431
). Could you explain how exactly the species heterogeneity is computed in GRiD? 3) By the way what is the rational behind this value? As I understand it kind of measures the degree of variability in the coverage trend. Why not just use a measure based on the variance of the residuals around the Tukey's biweight spline? 4) I noticed that thesingle
module provides more outputs than themultiplex
one. Are there any technical limitations that prevent to compute confidence intervals, dna/ori, ter/dif and generate the coverage plot during themultiplex
procedure? The mapping / parsing of the SAM are really time expansive so I would prefer not to do it thousands of time. 5) Considering the case of NHFF01 above, I ran thesingle
module on this genome and this sample, in order to obtain the coverage plot. First, I noticed that the values reported, while still incoherent, change a lot.The coverage looks like this :
What do you think is going on here? While it is clear that this genome is poor quality, I would like to be sure to be able to filter out such incoherent estimations afterwards. The fact that in the multiplex analysis it was reported as
GRiD refined = 1
with a not-so-high species heterogeneity (0.44) makes me nervous for the future reliability of my results.And finally some unrelated suggestions:
7) It seems to me that paired-end mapping is usually more specific (if you filter for properly paired reads afterwards). Would you be interested for a pull request that implement support for paired-end reads or is there a technical limitation in GRiD that I am not aware of? 8) I would also suggest to add an option
--overwrite
that makes GRiD simply ignore if the output directory is empty or not. GRiD could then more easily be implemented inside a bigger pipeline (Snakemake, Nextflow, ...).Again nice job with GRiD and thank you in advance for providing support for your potential users :) .
Cheers, Nils