s-andrews / SeqMonk

SeqMonk NGS visualisation and analysis tool
GNU General Public License v2.0
47 stars 9 forks source link

Building Custom Genome/Working with published genome #275

Open riyabelani opened 2 months ago

riyabelani commented 2 months ago

Hello, I am trying to analyze RRBS data from M. californianus and there is a published scaffold genome that I used in Bismark to get bam and bed files. I want to visualize these against the genome so I can see where on the genome methylation is occurring. I am following the YouTube tutorial for creating a custom genome however in the tutorial when it says to create pseudo-chromosomes, the number 25 is used. Most Mytilus species have 14 genomes so I thought I should put 14 but when I do, it creates 19 anyways. What should I do and if the genome does not have chromosomal annotations but did have a gff file, will this work?

I am also doing this with a chromosomally annotated genome for M. trossulus but this genome was not showing up on SeqMonk. Would I have to create a custom genome for that analysis as well? My goal is to be able to visualize at what chromosomes DNA Methylation is occurring.

Thank you!

s-andrews commented 2 months ago

Pseudo chromosomes aren't meant to represent real chromosomes so the number you use doesn't have to relate to the species you're using. The problem is that if you use a scaffold based assembly then each scaffold will become a chromosome, so much of the interface in seqmonk will be unusable (genome view, any filter based on chromosomes etc). Also the data caching is done at the level of a chromosome so if you have thousands of chromosomes then you'll have tens of thousands of cache files and everyhing will be super slow.

A pseudo chromosome just groups scaffolds together into sensibly sized chunks. The scaffolds are still independent, it's just a display thing. We say around 25 as that's a good ratio for reducing your data into cached chunks.

When you do your analysis you're not using the pseudo chromosomes. You'll use genes or scaffolds in your reporting so this is just a technical implementation detail.