malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Advanced diplotype clustering orders genes in the CNV section by label rather than by genomic position #558

Open alimanfoo opened 2 weeks ago

alimanfoo commented 2 weeks ago

If the cnv_region contains multiple genes, the CNV heatmap rows will be ordered by the gene label rather than their genomic position. This can make it a bit confusing to try to understand the structure of the CNVs. Would be better to order by genomic position.

alimanfoo commented 2 weeks ago

Here's an example:

af1.plot_diplotype_clustering_advanced(
    region='X:8,438,477-8,460,887',
    snp_transcript='LOC125764232_t1',
    cnv_region='X:8,418,477-8,480,887',
    sample_sets=['1232-VO-KE-OCHOMO-VMF00044', '1231-VO-MULTI-WONDJI-VMF00043', '1236-VO-TZ-OKUMU-VMF00090'],
    sample_query="country in ['Kenya', 'Uganda', 'Tanzania'] and taxon == 'funestus'",
)

image

sanjaynagi commented 2 weeks ago

interesting, didnt notice this!

sanjaynagi commented 2 weeks ago

Doesnt seem to be the case in gambiae? Is there something odd about that funestus locus?

image

sanjaynagi commented 2 weeks ago

Just had a look at your example in the Af1 GFF, they are already in the order of genomic position, although LOC125764275 (middle gene) is on reverse strand.

image

So for some reason CNVs at LOC125764275 are not getting called.

alimanfoo commented 2 weeks ago

Just had a look at your example in the Af1 GFF, they are already in the order of genomic position, although LOC125764275 (middle gene) is on reverse strand.

No I don't think so, here are the three genes in the region I wanted to show CNV data for...

image

The middle gene should be LOC125764232 but it's not.

Actually, maybe the problem is that the GFF isn't sorted...

alimanfoo commented 1 week ago

Suggested fix is to sort the GFF when it is loaded within the genome_features() function.