hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

GSAlign and repeat masking #6

Open brettChapman opened 4 years ago

brettChapman commented 4 years ago

Hi

I have a question in regards to how GSAlign handles repeats, both hard-masked and soft-masked, and how it infers variants when generating the VCF file in regions where there are repeats. Does it skip over masked regions, either soft-masked or hard-masked or only if its hard-masked?

Thanks.

hsinnan75 commented 4 years ago

Hi Brett,

Thank you for your interest in GSAlign. GSAlign does not mask repetitive sequences, and some of he variants may be called from repetitive regions. It should be taken care of. I'll look for solutions for this issue.

brettChapman commented 4 years ago

Hi

I probably wasn't very clear. I know GSAlign doesn't mask repeats. I'm using Repeat Masker to mask repeats, and was wondering if GSAlign would recognise soft-masked repeats in lower-case "acgt" compared to non-masked "ACGT", and consider those regions during the alignment, either skipping them or only looking for unique alignments.

Usually when calling variants I would hard-mask repeat regions with "N" and then align raw genomic reads to the genome. I'd then apply some read depth cut-offs and then use variant calling tools such as GATK for the variant calls.

Since GSAlign aligns two whole genomes or chomosomes, I don't think repeats would become an issue compared to when aligning short genomic reads, as short genomic reads could potentially misalign due to any reasonably lengthy repeats within those genomic regions/reads.

I've previously consulted authors of minimap2 and VG, and both recommend not repeat masking for whole genome alignment using their tools. I was wondering what your recommendation would be, depending on how GSAlign operates. If it aligns large regions at a time then repeats would not be an issue, however if it breaks up the regions into smaller chunks for alignment, to improve performance and run time, then repeats may cause misalignment and consequently erroneous variant calls.

Thanks.

hsinnan75 commented 4 years ago

GSAlign does not differentiate lower- and upper-case in the genome sequences. It just compares two genomes directly and find the best local alignments. I'd also recommend not doing repeat masking for genome alignment. If it really bothers you, you could convert all lower-cases into 'N's, then the aligner would ignore all these regions.

brettChapman commented 4 years ago

Thanks. I've decided not to mask. I've consulted authors of VG, minimap2, cactus, and edyeet as well. All agree that its best to avoid masking. The exception being cactus, which requires soft-masking during its alignment, to identify regions for anchoring, but ultimately it too aligns through repeat regions in later stages of the alignment process.