What is the difference of effect between repeat-masking and without repeat-masking on calling tandem repeats?

mcfrith / last-rna

MIT License

49 stars 6 forks source link

What is the difference of effect between repeat-masking and without repeat-masking on calling tandem repeats? #11

Open Jesson-mark opened 3 years ago

Jesson-mark commented 3 years ago

Hi, I'm using tandem-genotypes to find tandem repeats(TR) from our human PacBio HiFi reads. I found a real TR which is not successfully called by tandem-genotypes. I used a repeat-masking genome to build index using lastdb. The parameters of lastdb, last-train and lastal is same as this recipe suggested.

I wonder if it is the repeat-masking genome that harms alignment so that many reads get a high mismap score. Could you give any suggestions on our problem? Besides, what is the difference of effect between repeat-masking and without repeat-masking on calling tandem repeats?

Thanks.

mcfrith commented 3 years ago

We usually use tandem-genotypes with repeat-masking, and it usually works fine. Repeat-masking means that it excludes repeats when finding potential matches between reads and genome. After that, it finalizes the alignments between reads and genome: at this stage the masking is not applied, so the alignments should extend into the repeats just fine.

It's hard to say what's happening in your case: it may be nothing to do with repeat masking. Try visualizing the alignments around your TR of interest. (A typical problem is a TR which is longer than the reads: we can't handle that.)

I'm also not sure what you mean by "called": tandem-genotypes takes a tandem-repeat annotation file as input, and it can only analyze TRs that are jn that file.

Jesson-mark commented 3 years ago

Thanks for your prompt reply. I will try your suggestions.

What I mean "called" is that tandem-genotypes can find(or analyze) a TR in a tandem-repeat annotation file. I used simpleRepeat.txt as annotation file and there is 1031708 TRs in it. The result file(tg.txt) of tandem-genotypes have 688415 TRs which means nearly 1/3 TRs are not analyzed. Is it because those TRs are longer than the reads?

mcfrith commented 3 years ago

Not sure, but here's a couple of relevant tandem-genotypes options: -u BP, --min-unit=BP: ignore repeats with unit shorter than BP (default=2). -vv shows output for all repeats, including ones not covered by any DNA read.

Jesson-mark commented 3 years ago

Thanks for your considerate suggestions. I'll have a try.