sestaton / tephra

A tool for discovering transposable elements and describing patterns of genome evolution
MIT License
30 stars 3 forks source link

LTR Age output #36

Open DiegoZavallo opened 5 years ago

DiegoZavallo commented 5 years ago

Hi Evan I was wondering if you could explain me the LTR age output. How is it calculate the age in each LTR family? The summary file have these columns:

Superfamily Family Family_size ElementID Divergence Age Ts:Tv Copia RLC_family0 43 RLC_family0_exemplar 0.0471 2355000 3.2242

This particular family have 43 elements (family_size), however there is only one exemplar (ElementID) element. What is the Age value? Is the age from one element? Or the average for all 43 elements in that family? How is it calculate?

Best

Diego

sestaton commented 5 years ago

The exemplar is the representative of the family, and by default the age command will try to calculate the age of only the exemplars for efficiency.

However, if you run the command with --all it will calculate the age of all the elements that were found.

The age value is in years, so in this case, the age of the element is estimated to be 2,355,000 years old (or since insertion). Note that the age is dependent on the substitution rate, so you will get the most accurate results by researching what that value is for your species and also passing that as an option to the 'age' command. The default value may not make sense for your species, so pay special attention to this option.

DiegoZavallo commented 5 years ago

Hi Evan, thanks for your reply. Could you point me the name of the file in which all the elements have the age? I did ran Tephra --all.

And additionally, what do you mean by adding the value of my species? Is this option valid for the --all option?

Thanks!

sestaton commented 5 years ago

Yes, there are a number of options to the age command, and you can see these by executing tephra age command with no arguments like so:

$ tephra age

[ERROR]: The '--genome' argument was not given or the file does not exist. Check input.    

Name:
     tephra age - Calculate the age distribution of LTR or TIR transposons.

Description:
     This subcommand takes a GFF3 of LTR or TIR transposons as input from Tephra and calculates
     the insertion time and age of each element using a substitution rate and model of evolution.

  USAGE: tephra age [-h] [-m]
      -m --man      :   Get the manual entry for a command.
      -h --help     :   Print the command usage.

  Required:
      -g|genome     :   The genome sequences in FASTA format used to search for LTRs/TIRs.
      -f|gff        :   The GFF3 file of LTRs/TIRs in <--genome>.
      -o|outfile    :   The output file containing the age of each element.
      -i|indir      :   The input directory of superfamily exemplars.
      --type        :   Type of transposon to calculate age for (must be 'ltr' or 'tir').

  Options:
      -i|indir      :   The input directory of superfamily exemplars.
      -r|subs_rate  :   The nucleotide substitution rate to use (Default: 1e-8).
      -t|threads    :   The number of threads to use for clustering coding domains (Default: 1).
      -c|clean      :   Clean up all the intermediate files from PAML and clustalw (Default: yes).
      -a|all        :   Calculate age of all LTRs/TIRs in <gff> instead of exemplars in <indir>.

So, to change the substitution rate and calculate the age of all elements with 12 threads, you could do something like:

tephra age -f tephra_ltrs_classified.gff3 -g genome.fas -o tehpra_ltrs_classified_ages.tsv -r 1e-10 -t 12 --type ltr --all
DiegoZavallo commented 5 years ago

Thanks evan! I'll run it as you suggest me. My input file, however is named tephra_ltrs_trims_classified.gff3, with the trims as well, but I assume that is the same file.

DiegoZavallo commented 5 years ago

Hi Evan, I ran as you told me: tephra age -f potato_dm_v404_all_pm_un_tephra_ltrs_trims_classified.gff3 -g potato_dm_v404_all_pm_un.fasta -o tehpra_ltrs_classified_ages.tsv -r 1e-9 -t 5 --type ltr --all

Indeed, all the elements have their own age. Nevertheless, I have a few questions:

Is the age file with only the examplar age include singletons? Are the singletons complete LTRs but with only one member in the family? Should I include them? I'm asking you this, because now I have more than 27,000 elements with their own age (including singletons, TRIMs and LARDs) and before it seems to be only 10,400 elements (accordingo to the file named: potato_dm_v404_all_pm_un_tephra_ltrs_trims_classified_family-level_domain_org.tsv.

And also I found something strange in the age per se. The first run, had an age range between 0- 10 millions years approx. and this run have a range between 0 - 100 millions! I saw the numbers and it appears to have an extra 0 in all the numbers. For instance:

First run: Copia RLC_family0 43 RLC_family0_exemplar 0.0471 2355000 3.2242

Second run: RLC_family0_LTR_retrotransposon11004_chr06_41085231_41090619 0.0471 23550000 3.2242

Could be this an error?

Best

Diego

sestaton commented 5 years ago

Is the age file with only the examplar age include singletons?

No, the exemplars are from multi-copy families, so they do not include singletons.

Are the singletons complete LTRs but with only one member in the family? Should I include them?

Yes, that's right. The singletons are single-copy elements not belonging to any family. I would include all the elements if possible because this will give you a broader view of the age distribution. This is not the default because it can be time-consuming to calculate for hundreds of thousands of elements.

Concerning the difference in ages, I would expect differences with a different sample of elements. Though, the confusion about the variation is puzzling because I'm not sure this is the same element. Could you share the age file, and also the classified LTR headers?

You can create that list like so:

grep ">" tephra_classified_ltrs.fasta | sed 's/>//' > tephra_classified_ltrs_headers.txt

You can share those files here or by email, whichever you prefer.

DiegoZavallo commented 5 years ago

Hi Evan, thanks for the reply I think I know what was wrong. In the first run the substitution rate was default (1e-8), and in the second one (running only tephra age) I set it as 1e-9. can this difference be responsible for the extra 0?

sestaton commented 5 years ago

Hi Evan, thanks for the reply I think I know what was wrong. In the first run the substitution rate was default (1e-8), and in the second one (running only tephra age) I set it as 1e-9. can this difference be responsible for the extra 0?

Yes! That is the reason for the difference, so it's not an error.