Closed agustin-bilat closed 1 year ago
Could you include the *.cat file generated for this example?
Here is a link to the requested file:
https://drive.google.com/file/d/1fNCByQv4nWEiOb9IKjiwi8uWAabOoKo9/view?usp=share_link
Alternatively, parts of that very file are shown below:
Ah..yes. This is a consequence of 1) RepeatMasker's rules for joining fragments and 2) the poor quality of un-curated libraries. Here are the pieces that were considered in this join:
476 24.2 9.1 5.5 chr1 7212 7421 289316966 C putatFam-307#Unkno 344 560 0 17
..
966 21.1 0.6 0.6 chr1 9946 10126 289314261 C putatFam-257#Unkno 1 181 921 25 ->
254 29.5 2.0 5.2 chr1 10040 10139 289314248 C putatFam-307#Unkno 37 133 427 25 <-
ProcessRepeats uses the classification as a key factor in determining compatibility among fragments. Unfortunately, "Unknown" is assigned to many family entries in this library and ProcessRepeats treats this as any other class designation (currently). So in the example above, ProcessRepeats identifies significant overlap between families 257 and 307 (86bp in the last two annotations) and merges them into one annotation from 9946 to 10139. The first annotation (putatFam-307) is compatible with the second 307 annotation (same orient, co-linear etc) and causes ProcessRepeats to join these three annotations as part of the same insertion. I am simplifying the rules a bit here, as there are different rules for different classes of TEs, and joining does consider insertion order relationships when finding compatible partners.
In this case 257 and 307 may be fragments of a larger family that should be merged in the library, or there is some mosaicism present in one or the other family. We should probably declare the class "Unknown" as a un-joinable class and leave these results alone in ProcessRepeats since not enough information is present to make a detailed call.
Let me know if you have any further questions.
That answers my questions.
Thank you very much.
It called my attention to find the same ID (column 15) being shared between elements that have different names (col. 10). This seems to be caused by using information coming from two consecutive alignments from the .align file into one single line at the .out file.
Could you please explain a little bit about why is this happening in my particular example? (i.e. why different elements share the same ID). What I understood from the documentation is that this inconsistency is caused as a consequence of trying to decide if fragments are derived from the same integrated transposable element by parsing (somehow) the information at the alignment file in order to better reflect biological reality in the annotation (.out file). However, I am not sure if what I am seeing corresponds to this description because both sequences are quite different. Thus, I think I am not fully understanding how the algorithm works when processing the alignment file to retrieve the annotation.
Command: RepeatMasker -pa 16 -e rmblast -lib Fgig_extended_lib.fa -s -a -gff -dir outRMask_slow Fgig.chrom.fna
Version: RepeatMasker version 4.1.2-p1
Fgig_extended_lib.fa is alocal library obtained from RepeatModeler.
I attach below the part of the output (.align and .out) where you can see the inconsistency that I mentioned in the elements with an ID = 15; and also the local alignment between the two consensus sequences (input queries) involved in this inconsistency.
Many thanks , Agustin
rpmask.subset.align.txt
rpmask.subset.out.txt
alignment.water.txt