mbhall88 / head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
GNU General Public License v3.0
5 stars 2 forks source link

Investigate R28980 & R20574 sample swap #68

Closed mbhall88 closed 3 years ago

mbhall88 commented 3 years ago

This pair has a COMPASS distance of 0 and a bcftools distance of 27. It is the cause of the only FN for SNP thresholds 0, 2 & 5 and one of the two causes for SNP threshold 12.

mbhall88 commented 3 years ago

R20574 seems to be the cause of this discrepancy. The reason for this is that R20260 has total agreement between COMPASS and bcftools at these 27 sites, but R20574 has differences between COMPASS and bcftools at all of these sites.

The sample has high depth at all of these positions, shows no signs of HETs, shows no signs of strand bias, The lineage call for the nanopore data is also the same as the Illumina data.

I'm at a bit of a loss as to what else to do other than just :shrug:

mbhall88 commented 3 years ago

For future reference, the positions where the 27 differences occur is

[7581, 131114, 143463, 637419, 737014, 761109, 761154, 828219, 886669, 1055774, 1424975, 1471473, 1473245, 1673422, 2205857, 2215433, 2287683, 2289201, 2322806, 2333097, 2605149, 2976245, 3204704, 3511367, 3665012, 4247430, 4248001]

The bcftools VCF is on noah at /hps/nobackup/research/zi/projects/tech_wars/analysis/baseline_variants/nanopore/filtered_snps/south_africa/R20574.snps.filtered.bcf
The COMPASS VCF is at /hps/nobackup/research/zi/projects/tech_wars/analysis/baseline_variants/illumina/gvcfs/R20574.compass.vcf.gz

iqbal-lab commented 3 years ago

How about reopen this and transfer ownership to me and Martin for a couple of days, we can look on mon/tues. Second/third pair of eyes

iqbal-lab commented 3 years ago

Any chance you couldput paths to bams here, so we can look at the pileups?

mbhall88 commented 3 years ago

That would be very much appreciated.
I don't have BAMs (I delete them after I have created the pileup). But I have a pileup from bcftools mpileup using option --ignore-overlaps which Is on noah at /hps/nobackup/research/zi/projects/tech_wars/analysis/baseline_variants/nanopore/pileups/south_africa/R20574.pileup.bcf this pileup is then used to call SNPs by bcftools call

I did just create a BAM for this too though if you want it /hps/nobackup/research/zi/projects/tech_wars/analysis/baseline_variants/tmp/R20574.bam

iqbal-lab commented 3 years ago

Does Martin have access to this repo btw? Actually it doesnt matter. Will take a look at this with him, either Monday afternoon or Tuesday first thing

mbhall88 commented 3 years ago

It's a public repo, so he should be able to see all of this.

mbhall88 commented 3 years ago

I think I have figured out what is happening here!

tl;dr there was a sample swap between R20574 and R28980

I actually realised there was something weird going on here when looking at the differences between the COMPASS and bcftools clusters. I was looking at the samples that had poor recall and precision cluster-wise between the two. It is best illustrated with a drawing

IMG_1827

In red boxes are the samples. The Illumina distance between 2 samples is represented with a blue line and number, and nanopore with black. I've put a red square around the noticeably discrepant distances.
So we see there is something funny going on between the trio of R20260, R28980, and R20574 - two of which are the samples in the title of this issue.

When I look at the positions with SNP differences between R20260 and R28980 on Illumina and compare those positions to the differences between R20260 and R20574 on nanopore they overlap in the following way:

The next thing I did was look at the pairwise differences between the Illumina and Nanopore consensus sequences for these three samples. The _b after the sample name indicates bcftools (nanopore) and the _c indicates COMPASS (Illumina)

R20260_c x R20574_c: 0
R20260_c x R28980_c: 34
R20260_c x R20260_b: 0
R20260_c x R20574_b: 28
R20260_c x R28980_b: 0
R20574_c x R28980_c: 34
R20574_c x R20260_b: 0
R20574_c x R20574_b: 28
R20574_c x R28980_b: 0
R28980_c x R20260_b: 33
R28980_c x R20574_b: 0
R28980_c x R28980_b: 32
R20260_b x R20574_b: 27
R20260_b x R28980_b: 0
R20574_b x R28980_b: 26

Admittedly this melted my brain for a little while. But there are four distances that explain the whole thing

R20574_c x R20574_b: 28
R28980_c x R28980_b: 32
R28980_c x R20574_b: 0
R20574_c x R28980_b: 0

Conclusion

I'm getting to the limits of how much my brain can handle in a single sitting - so correct me if this is wrong - but is it actually possible to know whether it was the Illumina or nanopore that got swapped? One thing that makes me slightly inclined to think it could be the nanopore that got swapped is that those two samples were multiplexed in the same run, with R28980 being barcode 14 and R20574 being barcode 17. It is plausible that maybe this got mixed up when it was entered in the spreadsheet I received or I guess maybe there was a mix up when adding the barcodes in the lab?

I'll forward this on to Anastasia and see if she can shed any light.

mbhall88 commented 3 years ago

Close pairwise distance dotplot before sample swap detection

bokeh_plot

And after swapping the Nanopore data for the two samples

bokeh_plot

We get a sizeable improvement in correlation coefficient!

Results from clustering will be added to #65

Note: this is not quite final as in order to speed up this analysis I ran bcftools mpileip with no per-base alignment quality (BAQ). We know from #38 that this results in slightly worse precision. Once the longer, more accurate bcftools mpileup with BAQ is finished I will update the plots. The couple of slight outliers left in the above plot all relate to this no-BAQ run

mbhall88 commented 3 years ago

The DST results confirm that this was indeed a Nanopore sample swap. If we hadn't swapped the labelling on the files, the mykrobe Illumina and Nanopore predictions would have been quite different, and the nanopore ones would have disagreed with the available DST for these samples