Strange results during the refinement step

nigiord commented 5 years ago

Hi and thank you again for providing GRiD to the community :),

I've been trying to use GRiD for a large scale analysis (thousands of genomes, hundreds of samples with expected high diversity). I first tested GRiD on a subset of my data (50 genomes, 25 samples). This is the procedure I followed so far (GRiD 1.3) :

create a custom database with update_database
update_database -d ./Index -g ./WGS -p "custom"
run grid multiplex on the database for each sample (only pair 1 of the paired-end reads)
grid multiplex -r ./{sample} -e "fastq.gz" -o ./{sample}_output -d ./Index -c 1 -n 11 with {sample} the name of the sample

As expected I obtain two files for each sample: {sample}.GRiD.txt and {sample}.pdf. Below is an example of output for one of the sample (results are similar in other samples):

$ column -t sample1/sample1.GRiD.txt
Genome               GRiD  GRiD_unrefined  Species_heterogeneity  Coverage
1007115_PRJNA195660  1     1               0                      3.449
1009708_PRJNA195655  1.26  2.19            0.424657534246575      41.054
1096769_PRJNA74601   1     1.4             0.285714285714286      163.665
110662_PRJNA13643    1.94  3.97            0.511335012594458      4.465
1265756_PRJNA182740  1     1.67            0.401197604790419      143.907
146891_PRJNA13548    1.26  2.37            0.468354430379747      8.136
166314_PRJNA37911    2.85  5.55            0.486486486486486      5.145
167546_PRJNA15746    1.79  3.22            0.444099378881988      8.259
167555_PRJNA15660    1     1.13            0.115044247787611      6.216
335992_PRJNA13989    1     1.84            0.456521739130435      74.682
439493_PRJNA19339    1     1.25            0.2                    214.679
466038_PRJNA182738   1     1.32            0.242424242424242      140.408
59919_PRJNA213       1.25  1.54            0.188311688311688      115.788
744985_PRJNA46519    1     1.22            0.180327868852459      48.016
74546_PRJNA13910     1.59  3.09            0.485436893203883      8.533
93060_PRJNA18633     1.39  2.5             0.444                  7.358
939306_PRJNA77845    1     3.39            0.705014749262537      16.528
DCWU01               1     1.84            0.456521739130435      3.275
NHEH01               1     1.15            0.130434782608696      6.741
NHFF01               1     1495.15         0.443186302377688      11.745
NZHJ01               1     1.26            0.206349206349206      15.918
NZHY01               1     1.14            0.12280701754386       5.577
NZYN01               1     2.39            0.581589958158996      3.782
PADK01               1     1.22            0.180327868852459      1.78
PAIJ01               1     1.18            0.152542372881356      21.636
PBPY01               1     1.15            0.130434782608696      3.077
PBVP01               1     1.17            0.145299145299145      2.027
PBVW01               1     1.37            0.27007299270073       2.049

I have several questions regarding the values obtained.

1) Most of the refined GRiD values are set to 1. Is that normal/expected? What about the GRiD_unrefined value that is at 1495.15 and is simply reported as 1 after refinement? I would have expected the genome NHFF01 to be simply filtered out since "GRiD values greater than 10 are discarded as this may be due to high coverage of a contaminant contig". 2) From your publication,

To circumvent these limitations, GRiD incorporates a refinement step. Here, for the peak coverage value, the lowest point of expected variation of the mean value is chosen. Likewise, the upper point of variance of the trough mean is selected for the trough value (Supplementary Fig. 1C). The resulting growth value is called “GRiD_refined”.

I've understood the procedure thanks to Supp. Fig. 1C, but I do not understand the rational behind it. Especially in regard with the refinements that I obtained above (How could I go from 1495.15 to 1?).

3) From your publication, I thought the species heterogeneity was computed as 1 - (GRiD_refined/GRiD_unrefined) but it does not seem to work for most of the values above (for instance, NHFF01 should have a species heterogeneity of 1 - (1/1495.15) = 0.9993 and not 0.4431). Could you explain how exactly the species heterogeneity is computed in GRiD? 3) By the way what is the rational behind this value? As I understand it kind of measures the degree of variability in the coverage trend. Why not just use a measure based on the variance of the residuals around the Tukey's biweight spline? 4) I noticed that the single module provides more outputs than the multiplex one. Are there any technical limitations that prevent to compute confidence intervals, dna/ori, ter/dif and generate the coverage plot during the multiplex procedure? The mapping / parsing of the SAM are really time expansive so I would prefer not to do it thousands of time. 5) Considering the case of NHFF01 above, I ran the single module on this genome and this sample, in order to obtain the coverage plot. First, I noticed that the values reported, while still incoherent, change a lot.

Sample   GRiD  95% CI         GRiD unrefined  Species heterogeneity  Coverage  dnaA/ori ratio      ter/dif ratio
sample1  22.2  -5.08 - 21.92  94.07           0.764005527798448      11.149    0.0599682765321844  0.480600804084992

The coverage looks like this : Capture d’écran_2019-04-09_18-28-23

What do you think is going on here? While it is clear that this genome is poor quality, I would like to be sure to be able to filter out such incoherent estimations afterwards. The fact that in the multiplex analysis it was reported as GRiD refined = 1 with a not-so-high species heterogeneity (0.44) makes me nervous for the future reliability of my results.

And finally some unrelated suggestions:

7) It seems to me that paired-end mapping is usually more specific (if you filter for properly paired reads afterwards). Would you be interested for a pull request that implement support for paired-end reads or is there a technical limitation in GRiD that I am not aware of? 8) I would also suggest to add an option --overwrite that makes GRiD simply ignore if the output directory is empty or not. GRiD could then more easily be implemented inside a bigger pipeline (Snakemake, Nextflow, ...).

Again nice job with GRiD and thank you in advance for providing support for your potential users :) .

Cheers, Nils

aemiol commented 5 years ago

@nigiord Hi Nils, thanks once again for your message.

First thing to note, species heterogeneity is calculated before an extra layer of refinement step after GRiD refined is calculated. Clearly in your example, the refined GRiD value for genome NHFF01 was still greater than 10 and thus discarded and simply assigned a value of 1 (i.e. inactive). We deliberately leave these results in the output so the user can evaluate for themselves. Similarly, GRiD calculations for genomes with coverage medians <0.15 are discarded and would simply have GRiD and unrefined GRiD values of 1.

Second, including those extra parameters will significantly increase the runtime since it involves multiple subsampling steps.

Also, how many contigs in that genome in question? In the multiplex mode, we use the -a option in bowtie to output multiple alignments per read in order to be efficiently reassigned by pathoscope. The -a option isnt included in the single mode. The differences in your results may be due to the high contamination or fragmentation of your genome

In summary, the output of your results are not strange and are indeed expected.

Finally, regarding additional enhancement for paired-reads and --overwrite option, that could be something i could look into later on. At the moment, I am tied with other projects for the next couple of weeks.

Cheers, Tunde

nigiord commented 5 years ago

Hi @aemiol,

Thank you for your answer. It didn't occurred to me the filtered out GRiD values were set at 1, so most of it makes sense now.

Also, how many contigs in that genome in question? In the multiplex mode, we use the -a option in bowtie to output multiple alignments per read in order to be efficiently reassigned by pathoscope. The -a option isnt included in the single mode. The differences in your results may be due to the high contamination or fragmentation of your genome

The genome in question has 45 contigs. I didn't used the -p option so I do not think Pathoscope is involved. I guess there's not much to expect from a GRiD value > 1000 anyway, so the discrepancy can probably be explained by the high coverage variability for the contigs involved.

No problem for the additional enhancement, I'll make a pull request if I manage to craft something for my own use case.

Cheers, Nils

ohlab / GRiD

Strange results during the refinement step #8