wrong alignment view in IGV when there is a deletion in the SARS-CoV-2 samples

natinreg commented 3 years ago

I am going on with the issue previously posted by @iferres In the IGV images below we show a "normal" IGV alignment when samples are run with the Artic protocol (using Connor's lab implementation) and a problematic IGV alignment when poreCov is used (same result with the Medaka and new Nanopolish pipeline). The samples with the issue are P.1 variants and the problem starts at the del11288-11296 that is a feature defining P.1. While the IGV view is wrong (it can't capture the deletion and produces a run through in the remaining part of the genome), the deletion is correctly called in the vcf files. The consensus sequences are also okay. So, we think it is something with some post-processing of the bam files in the poreCov pipeline. Besides, as has been mentioned by @hoelzer, the region involves a primer border. However, as can be seen in the third attached image, the problem involves reads coming from both sets of primers. As I mentioned, the consensus is okay so the good thing is that this issue will not produce misclassification of variants. But it is annoying not being able to check, when needed, the read alignments. 3_poreCov_zoom 1_artic_Connor 2_poreCov

Could we share the fastq/bam files with you so you can try them?

natinreg commented 3 years ago

Sorry about the order of the images. The correct order is: middle (bams with artic), bottom (bams with poreCov, samples two and three with the deletion) and top (zoom for one of the samples where the deletion and the gaps should be seen).

replikation commented 3 years ago

hi thanks for replying. Indeed it would be easier if i can get my hands on this fastq file and test some parameters with it to figure out why it is not calling the deletion
not sure how big the fastq file is but you could upload it to e.g. https://osf.io/ ?
alternatively i could provide you an upload link via twitter (@gcloudchris) Thanks again for reporting the issue

hoelzer commented 3 years ago

Hi @natinreg, thanks for reporting!

The consensus sequences are also okay. So, we think it is something with some post-processing of the bam files in the poreCov pipeline.

So basically the consensus you get out of poreCov is correct but when you load the poreCov-generated BAM into IGV you see this strange pattern? And you can not see the deletion? Just let me make sure that this issue is not just based on some confusion of the BAMs and mapping targets:

1) the ARTIC pipeline (and thus also poreCov) map the reads to the Wuhan reference genome. After mapping, primers are clipped. The resulting BAMs are used for variant calling (*primmertrimmed.rg.sorted.bam). These BAMs are not copied into the final results folder but can be found in the work dir.

2) After the consensus is built, poreCov aligns again the reads to the consensus to calculate coverage parameters and perform visualization of the coverage. These BAMs can be found in the per-sample result folder where also the consensus FASTA is located. Of course, deletions called in comparison to the Wuhan reference genome can not be seen anymore in the reconstructed consensus.

However, your IGV screenshot "bottom (bams with poreCov, samples two and three with the deletion)" looks strange so something is going on here :)

natinreg commented 3 years ago

Dear @hoelzer, after reading your reply, now I can easily understand what is going on with the bams. I have always been interested in the primertrimmed.rg.sorted.bams. Never understood what was the difference with the final bam in the per-sample folder. It was clear that the IGV problem was something with the reference, but checked one of the files and it kept the Wuhan reference called (seq_ident_check.tsv). Now the issue is very clear (in the final bam the alignment is against the consensus!). Thank you very much for clarifying this and I hope @replikation hasn't put much time into this.

ps: I was wondering why you changed in each bam the reference sequence name to the barcode number!! thank you again

hoelzer commented 3 years ago

@natinreg ah great, then it was just some confusion of the BAMs! You're welcome. Now I also understand your "bottom (bams with poreCov, samples two and three with the deletion)"-Picture. The two bottom tracks are aligned against the consensus but you somehow managed to visualize them against the Wuhan reference genome. Thus, from the deletion position on, all nucleotides are shifted by a few based and thus all the substitutions pop up on the alignment view :) I am just wondering how you managed to load this into IGV ;)

And besides, good that you also look into the mappings, in particular, for important VOC samples - we discovered some strange amplicon-based edge cases w/ Illumina data and I'm sure similar things can also occur w/ Nanopore data.

However, I think this issue can be closed.

replikation / poreCov

wrong alignment view in IGV when there is a deletion in the SARS-CoV-2 samples #97