How to refine the panTElib library?

porkfan commented 3 months ago

Hello, Dr. Jiangzhao!

I used Professor Ou's latest updated EDTA2 and panEDTA pipelines to generate a panTElib library based on my multiple genomes. However, it is evident that there are still many redundant and incomplete sequences in it. Upon recommendations from several TE software developers, I have learned about your recent work, TEtrimmer. I hope to use this software to refine my constructed panTElib library. Your software requires an input file of a TE fasta library and the corresponding genome fasta sequence file. However, I used multiple sequence clusters to remove redundancies to obtain the panTElib library. Therefore, I am unsure if I should run TEtrimmer on the library annotated by EDTA2 for each sequence and their fasta sequences individually, and then run panEDTA, or if I can directly run TEtrimmer on the panTElib library that I have already constructed. Additionally, if this is feasible, do I need to select only one reference genome for input to refine, or do I need to input and refine each genome?

Best Wishes! Yifan Chen

qjiangzhao commented 3 months ago

Hi Yifan,

Thanks for your question. You can use the panTElib libary as input for TEtrimmer. But for the reference genomes, you need to select only one reference genome.

Because of the one genome limitation, I recommend to check the skipped elements by TEtrimmerGUI with your different genomes.

Yours sincerely Jiangzhao

abcyulongwang commented 3 months ago

@qjiangzhao

Dear Jiangzhao If we use the TE prediction results of different genomes to compare them to their own genomes, will the polymorphic transposons obtained be more accurate? However, there are a lot of skipped TEs. If we manually check each skipped TE, the workload is amazing. Therefore, we can only check Annotations_check_recommended and Annotations_check_required. If there is a better screening method, it will be worth a try. I have tested TEtrimmerGUI, and its effect is very good, showing the information of each TE in detail. However, I am still a novice in manual management. I still want to ask if there is a specific usage of this software, including how to select and discard TEs based on the visualization results, and delete different TE boundaries. TEtrimmer has also been recommended by many people. Please forgive me for the problems I have been having. I hope to get your reply.

Yours sincerely Yulong

qjiangzhao commented 3 months ago

Hi Yulong,

Thanks for your question. Yes, you are right, it is better to check each TE consensus library from different genomes rather than the panTElib. There must be some overlaps among the TE consensus libraries from your genomes, I recommend you apply TEtrimmer on one of them. After you can use TEtrimmerGUI to get a high-quality TE consensus library.

Subsequently, you can use the high-quality TE consensus library you got from the first genome to eliminate the identical TE consensus sequences from the TE consensus library derived from the second genome (probably you can do that by cd-hit-est, I will consider introducing one option to take care of this for the next version). Then, you can run TEtrimmer for the second genome based on the filtered TE consensus library. If you have more genomes, you can repeat this procedure.

As for the utilization of TEtrimmerGUI, you can refer to this issue #39.

To become more familiar with the manual curation of TEs, you can read this paper "A beginner’s guide to manual curation of transposable elements", which can also help you to understand how to use TEtrimmerGUI.

Yours sincerely Jiangzhao

abcyulongwang commented 3 months ago

Dear Jiangzhao

Thank you for your accurate and timely reply，All your suggestions are very helpful, but I still have some questions about how to get a high-quality TE consensus library for multiple genomes. As you said, combine the TEtrimmer results of the previous genome with the software prediction results of the next result, use cd-hit-est to remove redundancy, and you will get TE_combined.2genome.dedup.fa. Then use TE_combined.2genome.dedup.fa to run TEtrimmer on the second genome. This suggestion is very scientific and can ensure that the TE consensus library used for each genome is high-quality and non-repetitive, thereby improving the accuracy and reliability of TE prediction. I once ran TEtrimmer on a 2.5G genome. The original combine.TE.fa_cd_hit.fa was 13M, and the final TEtrimmer_consensus_merged.fasta was 4Mb. The running time was 6 days. The running memory in R was close to 800G. So my question is, if I follow this iterative strategy, if the number of genomes is large, for example, I have 30 genomes, after 30 iterations, the input transposon file will also become very large. Will this consume a lot of running time and running memory? Another strategy is to run TEtrimmer on each genome separately, generate a TE consensus library, and then merge all the TE consensus libraries to run cd-hit-est to get the final result. Obviously, this strategy sacrifices some annotation accuracy, but improves the calculation speed. I don’t know if my concerns are necessary, and I don’t know which method to use to complete this huge workload. I hope to get your professional advice.

Best wishes Yulong

qjiangzhao commented 3 months ago

Dear Yulong,

If you want to have a high quality TE consensus library (manual curation level) from your genomes, you can follow those steps:

Apply TEtrimmer on one of your genome (let's say genome_1) based on its own TE consensus library (TE_cons_lib_genome_1).
Use TEtrimmerGUI check and improve the outputs based on genome 1 and get a high quality TE consensus library for genome 1 (TE_cons_lib_TEtrimmer_genome_1).
Move to genome_2
Remove TE consensus sequence from "TE_cons_lib_genome_2" if the sequence share over 90% identity with "TE_cons_lib_TEtrimmer_genome_1". You can do by RepeatMask like: RepeatMasker TE_cons_lib_genome_2 -lib TE_cons_lib_TEtrimmer_genome_1 -nolow -pa 10 -s -dir <your_output_dir>
Then you get a ".out" file.
Run the following code to get the sequence IDs that don't share more than 90% identity with "TE_cons_lib_TEtrimmer_genome_1"

cat file_work_with.txt | awk '$9=="C" {print $0}' | awk -v OFS="\t" '{print $10, $11, $12+$13, $13-$14, $5, $7+$8, $7-$6, $2}' > file2.c.txt

cat file2.c.txt | awk -v OFS="\t" '{print $1, $2, $3, $4/$3, $5, $6, $7/$6, $8}' > file3.c.txt

cat file_work_with.txt | awk '$9=="+" {print $0}' | awk -v OFS="\t" '{print $10, $11, $14+$13, $13-$12, $5, $7+$8, $7-$6, $2}' > file2.f.txt

cat file2.f.txt | awk -v OFS="\t" '{print $1, $2, $3, $4/$3, $5, $6, $7/$6, $8}' > file3.f.txt

cat file3.f.txt file3.c.txt | awk '$7>0.1 {print $0}' > file_final.0.1.txt

cat file_final.0.1.txt | awk '{if($7>0.9 && $8<10) {print $0}}' | cut -f 5 | sort -u > TE_cons_lib_genome_2_ID.txt

based on the "TE_cons_lib_genome_2_ID.txt", you can extract the TE consensus sequence in "TE_cons_lib_genome_2", which have more than 90% identity with any sequences from "TE_cons_lib_TEtrimmer_genome_1". Extract the converse IDs and the corresponding sequences.

Run TEtrimmer for genome_2 but use the filtered sequence from the above code.
Repeat this for other genomes.

I will update this as a new option for the next version. If you can't follow the steps, please wait for the update.

You can clone the new version of TEtrimmer, which should consume less RMA. By the way, the RMA consumption correlates with the thread number you used for TEtrimmer.

Yours sincerely Jiangzhao

abcyulongwang commented 3 months ago

Thank you for your advice, I will continue to try it. Best wishes Yulong

qjiangzhao commented 3 months ago

No worries! Good luck. And thanks for using TEtrimmer and your valuable questions.

qjiangzhao commented 4 weeks ago

Dear Yifan and Yulong,

I added a new option --curatedlib, which can be used for your pangenome analysis.

Yours sincerely Jiangzhao

qjiangzhao / TEtrimmer

How to refine the panTElib library? #38