oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
330 stars 72 forks source link

Are there any hints to work with non-plant species? #91

Closed DrHogart closed 4 years ago

DrHogart commented 4 years ago

Hi, I'm trying to explore the TE content in mosquitos genome. As for as I understand the EDTA pipeline with developed to work with plant species and was inspired by this guide. I've found the undocumented option in the EDTA that can help to filter-out any protein-coding genes from the predicted TE families - $protlib (EDTA_raw.pl) referencing to the cleaned plant proteome. Obviously, I can change the link to the cleaned (w/o any traces of TE-derived proteins) mosquito-specific proteome. The question is are there any other tweaks that can help to work with other genomes rather than plants in the EDTA? I mean the discovery, filtering and cleaning options.

oushujun commented 4 years ago

Hi, yes, you don't need to tweak the code, but just provide a mosquito CDS file to the program (--cds) to filter protein-coding sequences in the TE annotation. Also, you may want to use --sensitive 1 to identify non-LTR retrotransposons (by RepeatModeler). Or if you have a manually curated set of TEs, please give it to the program via --curatedlib. The set does not have to be complete and comprehensive, but please make sure of the authenticity of the provided elements. There are many non-plant applications of this program as you may find them here #15

Best, Shujun

DrHogart commented 4 years ago

My genome is novel, just assembled, and gene annotation is not available yet. So, I prefer to use $protlib with proteins from the related species. My question arose after the reading of RM2 paper, in which they show that EDTA outperforms RM2 in the term of sensitivity only for plants but not for drosophila. So, I'm wondering what kind of settings may be tuned in EDTA to increase its sensitivity. Also, there are a lot of if ($beta2==1) subroutines inside the code that adds some additional cleaning to the predicted sequences. Did you test this functionality with the reference species?

oushujun commented 4 years ago

You may use the sister species' CDS to do the job. I don't recommend changing $protlib because that only does low-level cleaning. The RM2 paper shows that EDTA identified fewer sequences in Drosophila while RM2 identified more, which doesn't necessarily say RM2 was more sensitive. Image a program can identify 100% of the sequence, such a program certainly did something wrong. To benefit from the extra sensitivity RM2 may contribute, you can use the --sensitive 1 parameter which recruits RM2 to do an extra round of searching. beta2 is under-development, unmaintained, and not tested. Please don't use it for now.

Shujun

On Thu, Jun 18, 2020 at 10:52 AM Sergei Ryazansky notifications@github.com wrote:

My genome is novel, just assembled, and gene annotation is not available yet. So, I prefer to use $protlib with proteins from the related species. My question arose after the reading of RM2 paper https://www.pnas.org/content/117/17/9451, in which they show that EDTA outperforms RM2 in the term of sensitivity only for plants but not for drosophila. So, I'm wondering what kind of settings may be tuned in EDTA to increase its sensitivity. Also, there are a lot of if ($beta2==1) subroutines inside the code that adds some additional cleaning to the predicted sequences. Did you test this functionality with the reference species?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/91#issuecomment-646117031, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDSBKFXRJHUHC7LFYTRXIZ33ANCNFSM4OBLPQAA .

DrHogart commented 4 years ago

Thanks. Last question - why EDTA doesn't cluster the final TElib? CD-HIT and usearch shows that there are some redundant sequencies. E.g.

>Cluster 214
0       197nt, >TE_00000985#DNA/DTA... at 32:197:753:918/+/97.59%                                        
1       2782nt, >TE_00001061#DNA/DTA... *  
oushujun commented 4 years ago

The final TElib could have some level of redundancy but the highly redundant part should have been removed. Some sequences may share quite a bit of similarity with others but didn't meet the clustering threshold and will be kept as two sequences. You may use other clustering methods to perform extra clusterings.