Closed PlantDr430 closed 5 years ago
Yeah its probably counting a few twice if they are found in both the overlap and the blast. I'll make a note to separate these counters. The function is here: https://github.com/nextgenusfs/funannotate/blob/3129958475aa48a5a0d23b4af1cdcd85facbb388/lib/library.py#L4549-4616
Ah okay. It also seems that this is just removing transposable elements that are overlapping genes and not counting ones that are in between genes. I'll have to do separate TE analysis then. Thanks for replying.
Yes it’s not meant as a TE classifier or anything, just a method to filter out putative TEs. This was necessary when annotating highly repetitive fungal genomes, methods like MAKERs repeatrunner are too stringent and hide the sequence info from the predictors resulting in many fewer predictions than actually exist. So funannotate soft masks the genome and then goes back and tries to remove only those predictions that have high homology to known TE or are contained nearly completely in a soft masked region.
I am using v.1.5.1 as 1.5.2 just came out and looking at the script I don't see a change in this feature.
I was wondering how you calculated the transposable elements that are recorded during the filtering process. For example, in the logfile we see:
I was able to obtain the exact number by adding up the "repeat_match" counts in predict_misc/bad_models.gff with the "gene" counts in predict_misc/genome.repeats.to.remove.gff .
However, I noticed that sometimes there were duplicates, were a gene_model was found in bad_models.gff and genome.repeats.to.remove.gff. So technically, if there were duplicates then they shouldn't be counted as two transposable elements.
Again, I don't know if this is how you calculated the TE's, but it was the only way I was able to match the numbers that were recorded in the log files.