kavonrtep / dante_ltr

GNU General Public License v3.0
9 stars 0 forks source link

DANTE_LTR

Anaconda-Server Badge DOI

Tool for identifying complete LTR retrotransposons based on analysis of protein domains identified with the DANTE tool. Both DANTE and DANTE_LTR are available on Galaxy server.

Preprint

DANTE and DANTE_LTR: computational pipelines implementing lineage-centered annotation of LTR-retrotransposons in plant genomes, Petr Novak, Nina Hostakova, Pavel Neumann, Jiri Macas bioRxiv 2024.04.17.589915; doi: https://doi.org/10.1101/2024.04.17.589915

Principle of DANTE_LTR

Complete retrotransposons are identified as clusters of protein domains recognized by the DANTE tool. The domains in the clusters must be assigned to a single retrotransposon lineage by DANTE. In addition, the orientation and order of the protein domains, as well as the distances between them, must conform to the characteristics of elements from REXdb database Neumann et al. (2019). In the next step, the 5' and 3' regions of the putative retrotransposon are examined for the presence of 5' and 3' long terminal repeats. If 5'- and 3'-long terminal repeats are detected, detection of target site duplication (TSD) and primer binding site (PSB) is performed. The detected LTR retrotranspsons are classified into 5 categories:

dante_ltr_workflow.png

Availability

DANTE_LTR and DANTE are available on Galaxy server or can be installed using conda package manager.

Installation:

Anaconda-Server Badge

conda create -n dante_ltr -c bioconda -c conda-forge -c petrnovak dante_ltr

Open in Gitpod

Quick start guide - How to use DANTE and DANTE_LTR on Galaxy server

Detailed tutorial on how to use DANTE and DANTE_LTR on Galaxy server is here.

Quick start guide - How to use command line version of DANTE and DANTE_LTR

Installation of both DANTE and DANTE_LTR using conda into single environment:

conda create -n dante_ltr -c bioconda -c conda-forge -c petrnovak dante_ltr dante
conda activate dante_ltr

Download example data:

wget https://raw.githubusercontent.com/kavonrtep/dante_ltr/main/test_data/sample_genome.fasta
Run DANTE on sample genome using 10 cpus:
dante -q sample_genome.fasta -o DANTE_output.gff3 -c 10

Output will contain annotation of individual protein domains identified by DANTE stored in GFF3 file named DANTE_output.gff3. Check DANTE documentation for more details (https://github.com/kavonrtep/dante)

Identify complete LTR retrotransposons from DANTE ouput using DANTE_LTR

dante_ltr -g DANTE_output.gff3 -s sample_genome.fasta -o DANTE_LTR_annotation -M 1

Option -M 1 will allow one missing domain in the complete LTR retrotransposon.

Output files will include:

Create library of LTR RTs for similarity based annotation

dante_ltr_to_library -g DANTE_LTR_annotation.gff3 -s sample_genome.fasta -o LTR_RT_library.fasta

This step will create non-redundant library of LTR-RT sequences suitable for similarity based annotation using RepeatMasker.

Tools description

Detection of complete LTR retrotransposons

usage: dante_ltr [-h] -g GFF3 -s REFERENCE_SEQUENCE -o OUTPUT [-c CPU]
                 [-M MAX_MISSING_DOMAINS] [-L MIN_RELATIVE_LENGTH] [-S MAX_CHUNK_SIZE]
                 [-v] [--te_constrains TE_CONSTRAINS] [--no_ambiguous_domains]

        Tool for identifying complete LTR retrotransposons based on 
        analysis of protein domains identified with the DANTE tool

options:
  -h, --help            show this help message and exit
  -g GFF3, --gff3 GFF3  gff3 file with full output from Domain Based Annotation of Transposable Elements (DANTE)
  -s REFERENCE_SEQUENCE, --reference_sequence REFERENCE_SEQUENCE
                        reference sequence as fasta file
  -o OUTPUT, --output OUTPUT
                        output file path and prefix
  -c CPU, --cpu CPU     number of CPUs
  -M MAX_MISSING_DOMAINS, --max_missing_domains MAX_MISSING_DOMAINS
  -L MIN_RELATIVE_LENGTH, --min_relative_length MIN_RELATIVE_LENGTH
                        Minimum relative length of protein domain to be considered for retrostransposon detection
  -S MAX_CHUNK_SIZE, --max_chunk_size MAX_CHUNK_SIZE

                                If size of reference sequence is greater than this value, reference is '
                                'analyzed in chunks of this size. default is 100000000 '
                                'Setting this value too small  will slow down the analysis

  -v, --version         show program's version number and exit
  --te_constrains TE_CONSTRAINS
                        csv table specifying TE constraints for LTR search, template for this table 
                        can be found in https://github.com/kavonrtep/dante_ltr/blob/main/databases/lineage_domain_order.csv
  --no_ambiguous_domains
                        Remove ambiguous domains from analysis

Example:

mkdir -p tmp
./dante_ltr -g test_data/sample_DANTE.gff3 -s test_data/sample_genome.fasta -o tmp/ltr_annotation

Files in the output of extract_putative_ltr.R:

Making library of LTR RTs for RepeatMasker

If you want to annotate LTR RT elements with custom library using similarity based approach, you can use dante_ltr_to_library script wich will create non-redundant library which is formatted for RepeatMasker:

usage: dante_ltr_to_library [-h] -g GFF3 -s REFERENCE_SEQUENCE -o OUTPUT_DIR [-m MIN_COVERAGE] [-c CPU]

Creation of repeat library from dante_ltr output. Extract sequences based on gff3 inpute and reference fasta file. Run mmseqs2 clustering to cluster similar sequences to reduce library size. Exclude
clusters which have conflicting annotations and coverage below specified threshold.

options:
  -h, --help            show this help message and exit
  -g GFF3, --gff3 GFF3  gff3 file
  -s REFERENCE_SEQUENCE, --reference_sequence REFERENCE_SEQUENCE
                        fasta file
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        output directory
  -m MIN_COVERAGE, --min_coverage MIN_COVERAGE
                        Minimum coverage of cluster to be included in repeat library (default: 3)
  -c CPU, --cpu CPU     Number of cpus to use

GFF3 DANTE_LTR output specification

Types of features in GFF3:

Attributes of features in GFF3:

Modifying LTR-RT search constrains

It is possible to modify constraints for LTR search by providing a csv table with constraints for individual lineages.

The table has the following format:

Lineage Domains order offset5prime offset3prime domain_span ltr_length
Class_I/LTR/Ty1_copia/Ale GAG PROT INT RT RH 2000 2000 5700 123
Class_I/LTR/Ty1_copia/Alesia GAG PROT INT RT RH 2000 3000 5400 273
Class_I/LTR/Ty1_copia/Angela GAG PROT INT RT RH 6000 3000 5500 1074
Class_I/LTR/Ty1_copia/Bianca GAG PROT INT RT RH 3500 3000 6000 132
Class_I/LTR/Ty1_copia/Bryco GAG PROT INT RT RH 3000 3000 5000 287
Class_I/LTR/Ty1_copia/Gymco-I GAG PROT INT RT RH 3500 2500 5400 151
Class_I/LTR/Ty1_copia/Gymco-II GAG PROT INT RT RH 2000 6000 4600 156
Class_I/LTR/Ty1_copia/Gymco-III GAG PROT INT RT RH 2000 2000 5400 247
Class_I/LTR/Ty1_copia/Gymco-IV GAG PROT INT RT RH 2000 2000 5400 276
Class_I/LTR/Ty1_copia/Ikeros GAG PROT INT RT RH 6500 3000 6100 359
... ... ... ... ... ...

Modify these constraints if you think that the default constraints lead to under-detection of elements whose structure deviates from the default constraints. Setting offset5prime, offset3prime or domain_span too high can however lead to the detection of aberrant or chimeric elements.

To use modified constrains use dante_ltr with option --te_constrains and provide the path to the modified csv table.

The full table with default constraints can be found in
databases/lineage_domain_order.csv.