Interpreting discrepancies between alignment (.align file) and annotation (.out) and understanding how IDs are assigned.

agustin-bilat commented 1 year ago

It called my attention to find the same ID (column 15) being shared between elements that have different names (col. 10). This seems to be caused by using information coming from two consecutive alignments from the .align file into one single line at the .out file.

Could you please explain a little bit about why is this happening in my particular example? (i.e. why different elements share the same ID). What I understood from the documentation is that this inconsistency is caused as a consequence of trying to decide if fragments are derived from the same integrated transposable element by parsing (somehow) the information at the alignment file in order to better reflect biological reality in the annotation (.out file). However, I am not sure if what I am seeing corresponds to this description because both sequences are quite different. Thus, I think I am not fully understanding how the algorithm works when processing the alignment file to retrieve the annotation.

Command: RepeatMasker -pa 16 -e rmblast -lib Fgig_extended_lib.fa -s -a -gff -dir outRMask_slow Fgig.chrom.fna

Version: RepeatMasker version 4.1.2-p1

Fgig_extended_lib.fa is alocal library obtained from RepeatModeler.

I attach below the part of the output (.align and .out) where you can see the inconsistency that I mentioned in the elements with an ID = 15; and also the local alignment between the two consensus sequences (input queries) involved in this inconsistency.

Many thanks , Agustin

rpmask.subset.align.txt

rpmask.subset.out.txt

alignment.water.txt

rmhubley commented 1 year ago

Could you include the *.cat file generated for this example?

agustin-bilat commented 1 year ago

Here is a link to the requested file:

https://drive.google.com/file/d/1fNCByQv4nWEiOb9IKjiwi8uWAabOoKo9/view?usp=share_link

Alternatively, parts of that very file are shown below:

Fgig.chrom.fna.subs.cat.txt

Fgig.chrom.fna.tail_batch.cat.txt

rmhubley commented 1 year ago

Ah..yes. This is a consequence of 1) RepeatMasker's rules for joining fragments and 2) the poor quality of un-curated libraries. Here are the pieces that were considered in this join:

 476 24.2 9.1 5.5 chr1                   7212     7421 289316966 C putatFam-307#Unkno      344      560        0                17
..
 966 21.1 0.6 0.6 chr1                   9946    10126 289314261 C putatFam-257#Unkno        1      181      921                25 ->
 254 29.5 2.0 5.2 chr1                  10040    10139 289314248 C putatFam-307#Unkno       37      133      427                25 <-

ProcessRepeats uses the classification as a key factor in determining compatibility among fragments. Unfortunately, "Unknown" is assigned to many family entries in this library and ProcessRepeats treats this as any other class designation (currently). So in the example above, ProcessRepeats identifies significant overlap between families 257 and 307 (86bp in the last two annotations) and merges them into one annotation from 9946 to 10139. The first annotation (putatFam-307) is compatible with the second 307 annotation (same orient, co-linear etc) and causes ProcessRepeats to join these three annotations as part of the same insertion. I am simplifying the rules a bit here, as there are different rules for different classes of TEs, and joining does consider insertion order relationships when finding compatible partners.

In this case 257 and 307 may be fragments of a larger family that should be merged in the library, or there is some mosaicism present in one or the other family. We should probably declare the class "Unknown" as a un-joinable class and leave these results alone in ProcessRepeats since not enough information is present to make a detailed call.

Let me know if you have any further questions.

agustin-bilat commented 1 year ago

That answers my questions.

Thank you very much.

rmhubley / RepeatMasker

Interpreting discrepancies between alignment (.align file) and annotation (.out) and understanding how IDs are assigned. #212