Question about how to run Tephra correctly?

CSU-KangHu commented 1 year ago

Hi,

I'm trying to use Tephra to generate the TE library for Caenorhabditis briggsae (GCF_000004555.2_CB4_genomic.fna). I downloaded the docker image from Docker Hub and successfully ran the container. I've read through the Specifications and example usage and have only changed the genome, outfile, repeatdb, genefile, threads parameters in the all section, leaving the rest as default.

I have three main questions, and I'd appreciate your assistance:

Since my goal is to generate the TE library specific to the species, I don't understand why repeatdb is needed (Is it a database like RepBase?).
According to your description, genefile is used to filter false positives in TIR elements, but I'm unsure how to download the gene file for a specific species. These additional inputs seem a bit demanding for a regular user.
I provided two empty files, fake_repeat.fa and fake_gene.fa, in the repeatdb and genefile parameters. After running Tephra, the program seems to be stuck after running for several hours.

I have a small suggestion for Tephra. I think only the genome parameter should be mandatory, while the rest can be optional. The program should be able to run even if the user doesn't provide them.

CSU-KangHu commented 1 year ago

I've downloaded the gene file (GCF_000004555.2_CB4_rna.fna) for Caenorhabditis briggsae from NCBI and provided the repeat database (cbrrep.ref) sourced from Repbase. However, even after running for some time, the program remains stuck at the message 'Command - 'tephra illrecomb' started at: 21-09-2023 12:12:25'.

sestaton commented 1 year ago

Hi @CSU-KangHu,

Since my goal is to generate the TE library specific to the species, I don't understand why repeatdb is needed (Is it a database like RepBase?).

The repeat database is used for annotating unclassified elements. Transposons are classified into superfamilies according to 1) structure, and 2) the presence and 3) order of coding domains. If none of this information is present it is difficult to classify the TE other than a similarity search against against a database of TEs. If this search fails to find significant matches (using published thresholds for matches) then the elements are marked as unclassified according to order otherwise they will be annotated at the superfamily level producing the signfiicant matches. The database is not used for family-level classification.

According to your description, genefile is used to filter false positives in TIR elements, but I'm unsure how to download the gene file for a specific species. These additional inputs seem a bit demanding for a regular user.

This is a good suggestion. I will think about making this optional. The idea is to provide as much information as possible in the annotation process so you end up with a high confidence set of annotations rather than a large number of predictions that may lead to conflicts at later stages of the analysis. This is based on my experience but I can provide more details on this point and where to get the files that are needed.

I provided two empty files, fake_repeat.fa and fake_gene.fa, in the repeatdb and genefile parameters. After running Tephra, the program seems to be stuck after running for several hours.

I see in the log you provided that it is receiving the ABRT signal so something has killed the tephra process that was running or some of the sub-processes are being killed. You might want to check the memory usage and whether or not you are reaching limits (CPU, RAM, or disk space) on your computer or with the queueing system if it is a computing cluster. I don't believe this is related to the input but you will definitely want to provide non-empty files. There are some validation steps on the input but I will add checks to make sure the inputs are not empty in the next version..

CSU-KangHu commented 1 year ago

Hi @sestaton , My machine supports a maximum of 40 threads. I noticed that when I set a large thread number, for example, 35, Tephra tends to get stuck. Therefore, I reduced the thread number to 20. Tephra was able to successfully run on Caenorhabditis briggsae (GCF_000004555.2_CB4_genomic.fna) and produced the output cb_tephra_transposons_complete.fasta.

Subsequently, I attempted to run the rice genome (GCF_001433935.1_IRGSP-1.0_genomic.fna) with the same configuration. However, after running for nearly 8 hours, Tephra encountered an error. May I ask what could be the reason for this?

sestaton commented 1 year ago

Thank you for the feedback, @CSU-KangHu. It is hard to tell exactly what is happening other than there were not TIRs found, which is unexpected for rice. Perhaps the data could not be written to disk? If you could check that you have ample disk space that would be a good place to start.

Also, if you could point me to the input files you are using I will try to recreate the issue to better understand what is happening.

CSU-KangHu commented 1 year ago

Thank you for your assistance. The program's input and output data are stored in the /home/huk/tephra_lib/rice directory, and there is still sufficient space.

Download links for input data:

Genome file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/433/935/GCF_001433935.1_IRGSP-1.0/GCF_001433935.1_IRGSP-1.0_genomic.fna.gz.
Gene file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/433/935/GCF_001433935.1_IRGSP-1.0/GCF_001433935.1_IRGSP-1.0_rna.fna.gz.
Repbase file needs to be downloaded in fasta format from https://www.girinst.org/server/RepBase/index.php, I used oryrep.ref.
Tephra configuration file only modified the 'all' section.

CSU-KangHu commented 1 year ago

By the way, when I run the Arabidopsis thaliana genome with the same configuration (20 threads), the same issue occurs as with Caenorhabditis briggsae. The program still gets stuck at the 'tephra illrecomb' step.

Both the Arabidopsis thaliana and Caenorhabditis briggsae genome and gene files were downloaded from NCBI.

sestaton / tephra

Question about how to run Tephra correctly? #51