rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
225 stars 49 forks source link

[Bug Report]: A small bug in *.out file, which wrongly commbined two closed alignment in *.align #1

Open BioWu opened 6 years ago

BioWu commented 6 years ago

Details:

234 25.70 8.70 2.27 FBgn0040071 32795 33001 (1678) Jockey-3_DGri#LINE/Jockey 3207 3426 (1287) m_b557s001i5 6658

  FBgn0040071        32795 CCAACACCAGCAGCAACAGCAGCCTTCGCAGCCGCA----TAGC--TACC 32838
                              i  v        i       v iv     v   ----i   --i v 
  Jockey-3_DGri       3207 CCAGCAGCAGCAGCAGCAGCAGCATCAGCAGCAGCAGCATCAGCAGCAGC 3256

  FBgn0040071        32839 ACGAGCGTCTGGA-CAGCCAACCCGC--CTACAGGGGC-GCGGCAGCGGG 32884
                            vv   i  v v -    -- v v  -- vi   vi  -  i     i v
  Jockey-3_DGri       3257 AGCAGCATCAGCAGCAGC--AGCGGCAGCAGCAGCAGCAGCAGCAGCAGC 3304

  FBgn0040071        32885 TGCGGGCAGCTTCGCCACGCAGCCG-AGCAACTGT-GATACATCAGCGGG 32932
                           v    -    v  iv  -     v -    i v  -i v   v    i v
  Jockey-3_DGri       3305 AGCGG-CAGCATCAGCA-GCAGCAGCAGCAGCAGTCAAAACAGCAGCAGC 3352

  FBgn0040071        32933 TGCA--AGTAGCAGCAATACCAGCGGCAACAGCAACAACTCCT----CAG 32976
                           v   --  i    i   i v    i         i  i vv  ----   
  Jockey-3_DGri       3353 AGCAGCAGCAGCAACAACAGCAGCAGCAACAGCAGCAGCAGCTAAAACAG 3402

  FBgn0040071        32977 CGACGGCGGCGAGCAGCAACAGTAG 33001
                            ii i  i  -       i   i  
  Jockey-3_DGri       3403 CAGCAGCAGC-AGCAGCAGCAGCAG 3426

Matrix = 20p43g.matrix
Kimura (with divCpGMod) = 31.84
Transitions / transversions = 1.00 (26/26)
Gap_init rate = 0.07 (14 / 206), avg. gap size = 1.64 (23 / 14)

258 19.28 1.02 10.00 FBgn0040071 32903 33000 (1679) TART_DV#LINE/Jockey 2774 2863 (12236) m_b557s001i6 6658

  FBgn0040071        32903 GCAGCCGAGCAACTGTGATACATCAGCGGGTGCAAGTAGCAGCAATACCA 32952
                                --      v ii --  v    i vv    -i    i   i v  
  TART_DV#LINE/       2774 GCAGC--AGCAACAGCAA--CAACAGCAGCAGCAA-CAGCAACAACAGCA 2818

  FBgn0040071        32953 GCGGCAACAGCAACAACTCC-TCAGCGACGGCGGCGAGCAGCAACAGTA 33000
                             i              v  -v    i  i  ----           i 
  TART_DV#LINE/       2819 GCAGCAACAGCAACAACACCAGCAGCAACAGC----AGCAGCAACAGCA 2863

And the corrsponding line out file

258 22.7 5.1  5.9 FBgn0040071   32795 33001    (1678)        + Jockey-3_DGri LINE/Jockey     2774   2863    (12236)

As this example illustrates, TE reference and positions of these two closed alignment records were wrongly commbined. I found this bug in both v4.0.6 and this current version. I tried to fix it by myself, but I did found functions as RepeatMasker is one huge project with plenty of scripts.

Hope you could fix it.

Thanks!

rmhubley commented 6 years ago

Hi there, Thanks for the report. Unfortunately this a database problem and not a RepeatMasker problem. RepeatMasker is designed to merge significantly overlapping fragments of two repeats from the same Class and Subfamily to resolve noisy alignment artifacts. The correct call is highly dependent on the quality of the species database. In this example neither annotation is appears to be correct. The high divergence and the likelihood that this falls within the conserved reverse- transcriptase region of the LINE suggests that this is a bit of an uncharacterized lineage-specific subfamily of the Jockey element. Is the genome you are analyzing perhaps not D_virilis or D_grimshawi ( the species in which these two elements are derived )?

-R

BioWu commented 6 years ago

Yes, I applied it in D_melanogaster. But, the question is that the TE reference and aligned regions (in TE) reported in the *.out file were not properly paired. eg. Jockey-3_DGri LINE/Jockey 2774 2863 (12236), according to two alignment records the TE fragment aligned was 3207 3426 (1287). And this makes me confused. Besides, you said bith of these two alignments were wrong, I think I did not understand this. Did you mean these two alignments (TE fragment annotaions) were not right? I checked and found aligned regions in TART_DV and Jockey-3_DGri consisted of lots of poly CAG/CAA, did you mean that these tandem repeats might lead wrong alignment? Thanks

rmhubley commented 6 years ago

Oh yes, that does appear strange. Do you have the *.cat file from this run? I could use that to track down why it used the TART_DV coordinates for the Jockey-3_DGri annotation. Email me at rhubley@systemsbiology.org.

These are both purported to be lineage specific versions of a Jockey in _virilis or _grimshawi. Since they overlap significantly and they are low scoring (but not low enough to be considered a false positive) it leads me to suspect that this is a bit of a Jockey subfamily that doesn't have a good representation (consensus sequence) in the repeat libraries yet. And of course as you point out there is always the chance that this tiny hit is a higher scoring false positive that got through.

-R