sestaton / tephra

A tool for discovering transposable elements and describing patterns of genome evolution
MIT License
30 stars 3 forks source link

issue with gene file #48

Open jcerca opened 3 years ago

jcerca commented 3 years ago

Dear Evan,

I am really excited with getting Tephra running, it seems to be a beautiful piece of software. I had some issues I'd like to solve, though. I am putting them here so everyone can see, but please let me know if this shouldn't be on the issues.

I got the docker version running by installing docker:

$ docker run -it --name tephra-con -v $(pwd)/db:/db:Z sestaton/tephra
$ cd /db
$ wget https://raw.githubusercontent.com/sestaton/tephra/master/config/tephra_config.yml
#### changed the "logfile", "genome", "outfile", "repeatdb" (using your sunflower library, thank you for that!).
$ tephra all -c tephra_config.yml

[ERROR]: gene file was not defined in configuration or does not exist. Check input. Exiting.

I noticed that the new config file has this line. It is possibly new since it is not on the manual or help pages.

I deleted it:

$ sed "s/.*genefile.*//; /^$/d" tephra_config.yml > tephra_config2.yml
$ tephra all -c tephra_config2.yml

[ERROR]: 'trnadb' under 'all' is not defined after parsing configuration file.
         This indicates there may be a blank line in your configuration file.
         Please check your configuration file and try again. Exiting.

Q1: I interpret this that it did not like my re-formating of the config file. I was thus wondering what is this "TAIR10_genes.fas". Is this the genetic annotations of arabidopsis? I checked NCBI and TAIR10 seems to be an assembly name for this species ( https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.4). Q2: Is there a way to run the "all" command without specifying the annotations? See config file below.

$ cat t*yml
## For more information about this file, see:
## https://github.com/sestaton/tephra/wiki/Specifications-and-example-usage.
all:
  - logfile:          tephra.log
  - genome:           scalesia_atractyloides.fasta
  - outfile:          scalesia_atractyloides_thra_transposons.gff3
  - repeatdb:         Ha412v1r1_transposons_v1.0.fasta
  - genefile:         TAIR10_genes.fas
  - trnadb:           TephraDB
  - hmmdb:            TephraDB
  - threads:          24
  - clean:            YES
  - debug:            NO
  - subs_rate:        1e-8
findltrs:
  - dedup:             NO
  - tnpfilter:         NO
  - domains_required:  NO
  - ltrharvest:
     - mintsd:         4
     - maxtsd:         20
     - minlenltr:      100
     - maxlenltr:      1000
     - mindistltr:     1000
     - maxdistltr:     15000
     - seedlength:     30
     - tsdradius:      60
     - xdrop:          5
     - swmat:          2
     - swmis:          -2
     - swins:          -3
     - swdel:          -3
     - overlaps:       best
  - ltrdigest:
     - pptradius:      30
     - pptlen:         8 30
     - pptagpr:        0.25
     - uboxlen:        3 30
     - uboxutpr:       0.91
     - pbsradius:      30
     - pbslen:         11 30
     - pbsoffset:      0 5
     - pbstrnaoffset:  0 5
     - pbsmaxeditdist: 1
     - pdomevalue:     1E-6
     - pdomcutoff:     NONE
     - maxgaplen:      50
classifyltrs:
  - percentcov:       50
  - percentid:        80
  - hitlen:           80
illrecomb:
  - repeat_pid:       10
ltrage:
  - all:              NO
maskref:
  - percentid:        80
  - hitlength:        70
  - splitsize:        5000000
  - overlap:          100
sololtr:
  - percentid:        39
  - percentcov:       80
  - matchlen:         80
  - numfamilies:      20
  - allfamilies:      NO
tirage:
  - all:              NO
sestaton commented 3 years ago

Hi Jose,

Thank you for the comments and I'm sorry for the slow response. I have been busy with a new job and I was doing a long-distance move last week. Now, to the issue, you are correct that the gene file entry in the config file is new. This was added to remove spurious transposon predictions (mainly TIR elements) that are actually tandem gene duplicates.

There is no way to remove the entry at this time and run the tephra all command. I don't think I want to add this option because it will just mean the inclusion of spurious predictions based on my research.

The Arabidopsis gene file was an example for use with Arabidopsis specifically. My advice would be to use a set of gene predictions from your species, or a closely related species. This needs to be documented and the rationale needs to be explained because right now I think there is no mention at all. Sorry! Thank you for mentioning the issue here.

Please let me know if that helps and if you need advice on the files.

Thanks, Evan

sestaton commented 3 years ago

The wiki page that is referenced in the configuration file has been updated at least. A thorough demonstration of the usage is still needed but this is a small step.

jcerca commented 3 years ago

Hi Evan,

thank you for your answer and for your time. I'll try to do this as soon as I have some time! Possibly next week.

José