igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
634 stars 379 forks source link

question: increase visibility of SNPs at wider views? #1551

Open kevfengler227 opened 2 weeks ago

kevfengler227 commented 2 weeks ago

Is there a way to increase the visibility of SNPs at wider views? In the example below, I can see SNP differences between alignments in a 56 kb window, but not in a 140 kb window, which encompasses the entire region I want to display.

image

image

kevfengler227 commented 2 weeks ago

I should add the coverage track has the desired SNP visibility, just not the alignments.

image

jrobinso commented 2 weeks ago

Probably not at the moment, but I will look into it. When zoomed in all mismatches are shown, not just those deemed significant for the coverage track. My vague recollection is we stop doing this at some resolution as it becomes too cluttered, but this should be revisited.

jrobinso commented 2 weeks ago

Just a note here -- this only seems to happen with long read (3rd gen) data.

kevfengler227 commented 2 weeks ago

Indeed. These are actually 140 kb genomic segments aligned as HiFi reads. But I am trying to show the SNP variation and haplotypes in each genome. This is one way to turn IGV into a pangenome viewer!

jrobinso commented 2 weeks ago

Interesting. I'll have this fixed soon. If the dataset you are using or creating is public let me know, it would be an interesting test case to add.

jrobinso commented 2 weeks ago

One issue that arises as you zoom out is many bases land on the same pixel. At 100kb approximately 100 bases / pixel. For typical reads this means that nearly every pixel of the alignment will have a mismatch, often multiple mismatches. We might need some user options for how to handle this.

jrobinso commented 2 weeks ago

As illustration here is a 247kb window of pacbio alignments with mismatches drawn. Its not usable, and rendering is extremely slow. So some preferences or special mode is needed here

Screenshot 2024-08-28 at 11 01 21 PM
kevfengler227 commented 2 weeks ago

yes, I did not intend to use this capability for PacBio reads or ONT reads, but rather genomes with relatively few differences. So a "genome" mode would be ideal. I can send you an example public dataset.

Of course, the user needs to do some upfront work to create the ideal input data, but reducing each genome to 1x is extremely powerful, rather than 30x PacBio, and the only practical way to view a large pangenome.

jrobinso commented 2 weeks ago

A public dataset would be helpful.

jrobinso commented 2 weeks ago

There will still be limits on zoom out as at a minimum the sequence for the entire region needs to be loaded, not to mention the read sequence in every alignment. We could not view an entire chromosome with read sequences for example.

kevfengler227 commented 2 weeks ago

Admittedly, this will probably only work well for low diversity applications like my initial request. In that case there are only ~13 SNPs in a handful of genomes in a 140 kb range, which was just out of visibility limit, so I was hoping for way to crank up the SNP visibility, but that wouldn't make sense if there was a ton of variation- which is often the case for plant pangenomes.

It seems that 114 kb is the visibility max for SNPs, but INDELs are visible at much wider ranges.

image

image

But this real world example from the maize pangenome probably has too many SNPs to display nicely at wider-ranges, but in some specific cases it would still be useful

kevfengler227 commented 2 weeks ago

here genomes were aligned in 100 kb consecutive chunks

kevfengler227 commented 2 weeks ago

Here is a test dataset of mock data, with a few SNPs over 245 kb

image 10genomes.fasta.gz

test.fasta.gz

kevfengler227 commented 2 weeks ago

minimap2 -ax map-hifi -t4 test.fasta 10genomes.fasta | samtools view -b -1 - | samtools sort --write-index -o 10genomes.bam

kevfengler227 commented 2 weeks ago

So basically trying to use IGV has a haplotype-viewer

jrobinso commented 2 weeks ago

I've never used minimap2 but that's o.k. I think the simplest resolution of this issue would be to just make the max window for showing mismatches user settable, probably as a preference. A new display mode is a bigger topic that deserves its own issue, and would be longer term and prioritized vs other bigger topics.

I will also make snp display subject to the limit. BTW currently the limit is not on the genomic window, which can vary by display size, but on the resolution in bp / pixel

kevfengler227 commented 2 weeks ago

sounds great. thanks!

baozg commented 1 week ago

Related question: If loading IGV with more than 100 genomes (wholge genome alignment by minimap2 -x asm20), the speed would be very slow. If there any way to speed it up?

kevfengler227 commented 1 week ago

Rather then performing whole-genome alignments, I typically align consecutively 10kb chunked genomes, which is faster for alignment and the alignments can be toggled by mapping quality and alignment score. If you add the genome name to the read group when running minimap2 and merge the resulting bam files, 100 genomes is essentially the same as 100x Illumina coverage and is quite rapid to view in IGV.

image

kevfengler227 commented 1 week ago

If you zoom out you can see the PAV in the genomes well, just not the SNPs

image

kevfengler227 commented 1 week ago

coloring and grouping by read group is key

kevfengler227 commented 1 week ago

finally, if you number the chunks consecutively you know exactly where it came from in the query- which is much better than using kmers or other methods where coordinates are lost. Then you know you are looking at syntenic alignments when you mouse over a chunk and see it's chunk# (position) is similar to reference

baozg commented 1 week ago

Thanks for sharing! Chunking could be a good idea, but this also lose the abiltiy to detect the variation longer than chunk length or introduce ambiguous alignment (TEs). It more like chain by yourself as you know the coordinates. I think it would be better if IGV use chunk in the browser but with more contiguous alignments. Actually, I use AnchorWave and wfmash more often, whihc nearly produce end-to-end alignment in A.thaliana (easier than maize). For the alternative approach other than IGV, I use https://github.com/cmdcolin/jbrowse-plugin-mafviewer for convert my paf to pseduomaf (which only can present SNPs or DEL)

image
kevfengler227 commented 1 week ago

you can use whatever chunk size you want for a given application depending on the level of similarity in the pangenome, typically 1-100 kb (aligned with map-hifi). With that you can see quite large INDELs. Again, you can control what is displayed by changing the visualization parameters in IGV more so than with whole-genome alignments. Also, the directionality of chunks is indicative of inversions. For major differences the lack of an aligned chunk also informative.

so the real beauty of the chunked alignment approach is that is highly parallelizable and rapid. One can do an all-by-all comparison in minutes, so that all/any reference(s) can be viewed in IGV with all queries on a whim. If you want to get fancy you can group your queries into various sub-groups, rather than 1 big one.