Manually curating low copy number TEs

lanasushko commented 3 months ago

Hi!

I would like to use TEtrimmer to curate consensus sequences coming from PANTERA pipeline which identifies TEs from pangenome polymorphisms. For this reason, many of the TE families identified by this pipeline in my case are low copy number. I tried to run TEtrimmer using one of the reference genomes but most of the TEs were skipped due to low hit number.

Is it possible to change the copy number requirement for TEtrimmer runs? The messages in the log file said that "check_low_copy is False". Can I set it to True? I couldn't find any option like that in the manual.

Another question I have is about the possible workflow to curate this kind of TEs. Since the TEs in my library come from different genomes, by using only one of them as a reference in TEtrimmer I'm getting a bias. I am planning to do a first check on how many copies of each TE family is present in each of the genomes and then select the most suitable genome for manual curation of each TE. I think that this way at least I am doing manual curation on a genome with the highest number of good quality hits so that I can generate a good consensus in the end. Even though, sometimes there are not enough copies in the selected genome to pass the low copy number filter of TEtrimmer. Do you think that this could be a good workflow?

Also, I have another issue which is probably also related to the TEs being low copy number. A lot of consensus sequences that were processed by TEtrimmer in my first run with one reference genome lost their LTR of TIR sequences. Is it due to coverage thresholds within the automatic pipeline? Can I do something to avoid this from happening? example 1 LTR-RT

example 2 LTR-RT

example 3 DNA/MULE

Also, why are a lot of TEs trimmed even though there are full length copies present in the genome? example:

Thanks for developing TEtrimmer! Best,

Lana

qjiangzhao commented 3 months ago

Dear Lana:

TEtrimmer mainly employs TE multi-copy nature to identify the boundaries. If the blast hit number of the input consensus sequence is lower than 10, TEtrimmer checks, if the input consensus sequence contains terminal repeat sequence (LTR and TIR) and the full-length blast hits is greater than 2. If so, the input TE consensus sequence is regarded as a low copy element. Otherwise, you will see "check_low_copy is False" and the corresponding consensus sequence is skipped. For this reason, you can't set "check_low_copy" to true, it is decided automatically. You can use TEtrimmerGUI to check if those skipped elements are real TEs conveniently.

As for your second question, you can refer to issue #38. I will update TEtrimmer to make it more compatible for pangenomes.

For the third question. Could you send me your summary file and related pdf report file, which corresponds with the examples you showed? Then I can have another look.

Yours sincerely Jiangzhao

lanasushko commented 3 months ago

Hi Jiangzhao,

Thanks for the clarifications!

I am sending you the PDFs and summary files for the 4 examples. 3 of the runs were performed on a reference genome (refrun tag) and 1 of them was produced on another genome that had higher number of hits for this particular TE (angeb2run tag). You can see both summary files below.

angeb2run-CONS_200_17833_5_1_6496_175#LINE__L1.pdf refrun-CONS_200_17833_5_1_4624_446#DNA__MULE-MuDR.pdf refrun-CONS_200_17833_5_1_4737_416#LTR__Copia.pdf refrun-CONS_200_17833_5_1_4807_390#LTR__Copia.pdf summary_angeb2run.txt summary_refrun.txt

Thanks again and best wishes,

Lana

qjiangzhao commented 3 months ago

Dear Lana,

Thanks for supplying the files. Based on the report file and plots, many elements are skipped. Currently, TEtrimmer can't support pangenome research perfectly especially when you use "Pantera" to identify TEs based on the polymorphism. I will update TEtrimmer to make it more suitable for Pangenome.

Yours sincerely Jiangzhao

qjiangzhao / TEtrimmer

Manually curating low copy number TEs #41