mcfrith / last-rna

MIT License
48 stars 6 forks source link

Prepare a genome with or without repeat-masking #9

Closed LiShuhang-gif closed 3 years ago

LiShuhang-gif commented 3 years ago

Hi, I was trying to run tandem_genotypes to detect tandem repeats on my ONT data. But I have some questions when preparing a genome. I see there are two options in this step —— prepare a genome with or without repeat-masking. If I care more about effect and accuracy than running time, should I prepare a genome without repeat-masking? Or which option do you recommend? Thanks a lot.

mcfrith commented 3 years ago

For whole human genome sequencing, we usually do it "with" repeat masking. That has worked fine in several published papers. So that's what I'd recommend, really.

For best possible accuracy/sensitivity, it's better to do it without repeat masking. But that uses much more time and memory.

For a smaller genome (e.g. bacterial) I'd do it without masking.

AlisaGU commented 2 years ago

For best possible accuracy/sensitivity, it's better to do it without repeat masking. But that uses much more time and memory.

I have a query whose genome is 20G, and repeat annotation is still running. Can I do pairwise genome alignment using unmasked genome? It seems workable, although with more time and memory.

mcfrith commented 2 years ago

Pairwise genome alignment is a bit different from aligning long reads (in the preceding comments).

The preceding comments are also a bit out of date. Now I might suggest -uRY4 instead of masking, see: https://www.biorxiv.org/content/10.1101/2022.05.30.494079v1

You can surely do unmasked pairwise genome alignment, if you use an option such as -uRY to reduce the run time and memory use. If you don't use such an option, it might or might not be feasible: it depends on how big the other genome is, how closely-related, and how repetitive.

AlisaGU commented 2 years ago

Thanks, let me give it a try