rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

How to filter the short TE fragments from results file #232

Open aijigekoukou-shen opened 8 months ago

aijigekoukou-shen commented 8 months ago

Do I need to filter out TE fragments from the out file that are less than 100 bp in length? Or can I filter based on other criteria?

Here is my genome.fa.out file.

 score   div. del. ins.  sequence    begin    end          (left)   repeat              class/family        begin   end    (left)       ID

   228   10.7  0.0  8.5  Chr01           4821     4871 (50005267) C rnd-5_family-1488   LINE/L1                (16)   1545    1499       2  
   245   27.3  0.0  0.0  Chr01           6289     6365 (50003773) C TE_00001622         DNA/DTH               (849)   2845    2769       4 *
  1709    8.2  3.7  7.0  Chr01          30622    30678 (49979460) + TE_00000395         DNA/DTC                1655   1956     (0)      23  
   246   15.7  2.0  0.0  Chr01          44275    44325 (49965813) C TE_00000360         DNA/DTA               (100)    861     810      37 *
   601   21.0  5.0  9.7  Chr01          44865    44940 (49965198) C TE_00000371         DNA/DTA              (2229)    242       1      38  
   655   22.7  3.9  7.4  Chr01          50353    50444 (49959694) C rnd/5_family-2289   DNA/hAT-nMITE        (1389)    417     176      54  
   655   22.7  3.9  7.4  Chr01          50676    50714 (49959424) C rnd/5_family-2289   DNA/hAT-nMITE        (1389)    175      68      54  

Thank you for your attention.

rmhubley commented 7 months ago

I am not quite sure what you are asking. Smaller fragments do have a higher chance of being a false positive, although they may be part of a large alignment ( see ID field to identify joined fragments ). Under normal circumstances you shouldn't need to filter annotations from the *.out, however you may have specific reasons for doing so. There is a script in the util directory (RM2Bed.py) that has some filtering functions, but it may be easier to write your own script to do so.