qjiangzhao / TEtrimmer

TEtrimmer: a novel tool to automate manual curation of transposable elements
GNU General Public License v3.0
66 stars 2 forks source link

Some issues with the "cons" step #46

Open abcyulongwang opened 1 week ago

abcyulongwang commented 1 week ago

Dear Jiangzhao

I encountered this problem during manual management.

This is a TEAID of a modified multiple alignment of a LINE transposon. It looks OK. But when I run the cons step of TEtrimmer, it becomes like this, without even a complete sequence.

image

image

There are many similar examples. Some transposon cons sequences cannot even run TEAID successfully, and it displays "BLAST hit number is 0 for this sequence." I guess this means that we can completely abandon these TE sequences because they are of low quality. I also have an idea. After running TEtrimmer, I want to compare the quality of the TE library with and without manual editing. Do you have any recommended methods for quality comparison? Manual management is too time-consuming and it will be a nightmare for me.

Thank you for your previous reply, wish you happiness every day

Yulong

qjiangzhao commented 1 week ago

Hi Yulong,

Thanks for your question. But I am a little bit confused. What is the difference between those provided TE-Aid plots?

Could you explain your question again?

Yours sincerely Jiangzhao

abcyulongwang commented 1 week ago

The two TEaid pictures above are the same TE sequence. The difference is that the one below is the "TE.cons.fa" sequence after taking the consensus sequence.The TE consensus diagram below indicates that this cons.fa does not have any full-length hits, suggesting that the transposon structure is likely incomplete. Due to poor conservation of some bases in the multiple sequence alignment, they were replaced by "N" when generating the consensus sequence.

My question may seem foolish, but the reality is that only a small number of TEs have reached the Perfect and Good levels. Most of the TE clusters do not exhibit good conservation, and many of the cons.fa files contain a lot of N. Could this potentially affect my research on transposable element polymorphisms?

image

image

Should I just delete these bad results, even though it will reduce the number of transposon predictions? Best yulong

qjiangzhao commented 1 week ago

Dear Yulong,

The "TEAid" button in TEtrimmerGUI can be applied on multiple sequence alignment (MSA) or TE consensus sequence.

If it were a MSA, a consensus sequence will be generated with a threshold of 0.5. This corresponding with your plot:

image

When you clikc the "Cons" button, a consensus sequence will be generated with the default threshold of 0.8. You can also apply "TEAid" button based on this consensus sequence, the corresponding plot is:

image

Because the threshold used for consensus sequence generation is different, the "TEAid" plot also exhibited differently especially for poorly conserved TEs (like your LINE element).

You can modify the "Cons" threshold by:

Screenshot 2024-10-16 at 10 17 01

Based on your Aliveiw shreenshot, I won't discard this LINE element. You can choose to lower the "Cons" threshold number and save your MSA as a HMM model.

Yours sincerely Jiangzhao