nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Transposable Elements #272

Closed PlantDr430 closed 5 years ago

PlantDr430 commented 5 years ago

I am using v.1.5.1 as 1.5.2 just came out and looking at the script I don't see a change in this feature.

I was wondering how you calculated the transposable elements that are recorded during the filtering process. For example, in the logfile we see:

[01/06/19 11:26:39]: Found 1,069 gene models to remove: 554 too short; 0 span gaps; 540 transposable elements

I was able to obtain the exact number by adding up the "repeat_match" counts in predict_misc/bad_models.gff with the "gene" counts in predict_misc/genome.repeats.to.remove.gff .

However, I noticed that sometimes there were duplicates, were a gene_model was found in bad_models.gff and genome.repeats.to.remove.gff. So technically, if there were duplicates then they shouldn't be counted as two transposable elements.

Again, I don't know if this is how you calculated the TE's, but it was the only way I was able to match the numbers that were recorded in the log files.

nextgenusfs commented 5 years ago

Yeah its probably counting a few twice if they are found in both the overlap and the blast. I'll make a note to separate these counters. The function is here: https://github.com/nextgenusfs/funannotate/blob/3129958475aa48a5a0d23b4af1cdcd85facbb388/lib/library.py#L4549-4616

PlantDr430 commented 5 years ago

Ah okay. It also seems that this is just removing transposable elements that are overlapping genes and not counting ones that are in between genes. I'll have to do separate TE analysis then. Thanks for replying.

nextgenusfs commented 5 years ago

Yes it’s not meant as a TE classifier or anything, just a method to filter out putative TEs. This was necessary when annotating highly repetitive fungal genomes, methods like MAKERs repeatrunner are too stringent and hide the sequence info from the predictors resulting in many fewer predictions than actually exist. So funannotate soft masks the genome and then goes back and tries to remove only those predictions that have high homology to known TE or are contained nearly completely in a soft masked region.