qjiangzhao / TEtrimmer

TEtrimmer: a novel tool to automate manual curation of transposable elements
GNU General Public License v3.0
58 stars 2 forks source link

Some questions about the use of result files. #37

Closed abcyulongwang closed 1 month ago

abcyulongwang commented 1 month ago

Dear Jiangzhao

Thank you and your team for developing this excellent software, which is obviously very useful and relieves researchers from the huge workload of manual curation.

I successfully ran TEtrimmer and the files I got were all correct. My original input TE.fa had 579 transposon sequences, and I roughly input all types of TE directly into it for testing, including the Unknown type. My questions are

I successfully ran TEtrimmer and the files I got were all correct. My original input TE.fa had 579 transposon sequences. I roughly input all types of TE directly for testing, including the Unknown type. My question is

  1. Is the final TEtrimmer_consensus_merged.fasta file that I got usable? It finally retained 254 TE sequences, and almost half of the sequences were filtered. I want to know what the retention threshold is? Is it only Annotations_good/ and Annotations_perfect/ that are retained or something else?

  2. If I want to integrate multiple software to predict transposons on the genome, because different software have different prediction preferences, the detection efficiency of different types of transposons is different. For example, I manually manage the prediction results in RepeatModeler using TEtrimmer, and then want to use this result as a supplementary curated library for EDTA--curatedlib. EDTA requires that the supplementary library is 100% reliable, so should I use Annotations_perfect/ or Annotations_good/ or TEtrimmer_consensus_merged.fasta as input?

  3. Regarding the original file containing Unknown sequences, do I need to extract the Unknown sequences separately and run them separately, and use --classify_unknown. Is it necessary to use --classify_all to classify other known TEs?

Looking forward to your reply, it will be very helpful to me!

Best wishes yulong

qjiangzhao commented 1 month ago

Hi Yulong,

Thanks you are interested in TEtrimmer. Answer1: TEtrimmer_consensus_merged.fasta can be directly used for genome-wide TE annotation. The main reason why half of the sequences were filtered out is the corresponding input TE consensus sequence don't have enough copy in the genome. All skipped elements are stored in <your_output_path>/TEtrimmer_for_proof_curation/TE_skipped folder. You can use TEtrimmerGUI to check why they are skipped and if you want to rescue them back. "Annotation_perfect", "Annotation_good", "Annotation_check_recommended", and "Annotation_check_requried" are all included in TETrimmer_consenesus_merged.fasta. "check_recommended" or "check_required" don't means they are useless, TEtrimmer provids this evaluation system to help to do proof curation.

Answer2: I don't recommend to use TEtrimmer_consensus_merged.fasta for EDTA --curatedlib directly. It is highly recommended to use TEtrimmerGUI to do proof curation, which can get manual-curation level TE consensus library.

The other method to combine output from different TE annotation tools is combining TE consensus libraries from different tools to one file and adding "--dedup" option for TEtrimmer.

Answer3: You don't need to split your input file. You can simply add "--classify_all" option for the classification.

Yours sincerely Jiangzhao

abcyulongwang commented 1 month ago

Thanks for your reply. I integrated the results of three software and ran TEtrimmer. It was very fast at the beginning, but it became slower and slower later. For example, the 6455/6461 step may take a long time to run. Is there any strategy to speed it up?

image

Best wishes yulong

qjiangzhao commented 1 month ago

The time-consuming sequences will accumulate along the analysis process. So it is normal that the analysis becomes slower. I recommend waiting a little bit.