hyphaltip / mTEA

Mosquito TE Annotation
Artistic License 2.0
8 stars 3 forks source link

Classify element for TIR and TSD #2

Open hyphaltip opened 13 years ago

hyphaltip commented 13 years ago

Identify TSD and TIRs for a putative element that was screened from previous analysis step in pipeline.

arensburger commented 13 years ago

Uploaded a first draft of a script to address this issue: id_TIR_in_FASTA.pl. This script takes a fasta sequence as input and returns a gff-like file with all the possible location for TIRs in each sequence that are compatible with a specified set of constraints. This script is purposefully designed to find lots of hits, these will be narrowed down later by comparing the possible TSD/TIRs from different branches of the tree.

Here's the basic concept behind this script.

This script expects as input at least one fasta file. This fasta file is assumed 1) to be a section of a genome assembly, 2) to contain the sequence of a putative TE transposase, 3) to include some sequence upstream and downstream from the transposase sequence where TIR and TSDs will be searched for. Optionally the start and end of the tranposase sequence can be specified in the fasta title, otherwise the script will split the sequence into two equal halves and look for TIR and TSDs in each half (this might be useful when dealing with MITEs later).

The basic workflow is: 1) do a local blast between the two sequences flanking the transposase to identify possible TIRs 2) look at the sequences directly adjacent to the TIRs as possible TSDs. If TSDs are allowed to include indels then generate sequences with all allowed combinations of insertions and deletions in the TSDs. 3) compare all observed and all possible TSDs and select those that are similar enough given the allowed number of substitutions in the TSD sequence 4) write the positions in gff-like format

The TSD part is not very elegant, but given the very low number of sequences in the TSDs I don't see another way of dealing with indels than just brute force.

Next step is to take the TIR locations from different fasta files and determine 1) which fasta files have the same or similar TSDs, 2) who has dissimilar TSDs. Those fasta files with similar TIRs and dissimilar TSDs should be scored as having a high probability of being active.