oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
315 stars 70 forks source link

TE comparsions in different species #376

Closed YuboWang1994 closed 10 months ago

YuboWang1994 commented 10 months ago

Hi,

I truly appreciate your development of the EDTA pipeline, which has been immensely helpful to me. However, I have a small issue that I would like to trouble you with.

I am currently researching four plant genomes from the same Family. I have annotated repetitive sequences using the EDTA pipeline for each genome and obtained the corresponding non-redundant TE library. My next research step is comparing compositional differences in repetitive sequences within the sex-determining regions of these four genomes, such as copy numbers and sequence compositions. However, since each genome was annotated separately with EDTA, the four non-redundant TE libraries might contain the same TE IDs like TE_00000000, making direct comparisons challenging.

Considering that these four species share a common Family and exhibit high sequence and gene collinearity, I've designed the following approach:

  1. Classify each non-redundant TE library based on sequence types, such as TIR, helitron, LTR-gypsy, LTR-copia, LTR-unknown, etc.
  2. Merge TE sequences of the same type from the four genomes, remove redundancy using CD-HIT, and rename sequence IDs.
  3. Perform subsequent analyses using the new IDs, such as comparing copy number and compositional differences of the same TE across different species.

Does this sequence similarity-based approach for merging non-redundant TE libraries seem scientifically make sense?

I'm appreciate your help.

oushujun commented 10 months ago

Hello,

Please check out panEDTA for this purpose.

Thanks! Shujun

On Mon, Aug 7, 2023 at 12:18 AM YuboWang1994 @.***> wrote:

Hi,

I truly appreciate your development of the EDTA pipeline, which has been immensely helpful to me. However, I have a small issue that I would like to trouble you with.

I am currently researching four plant genomes from the same Family. I have annotated repetitive sequences using the EDTA pipeline for each genome and obtained the corresponding non-redundant TE library. My next research step is comparing compositional differences in repetitive sequences within the sex-determining regions of these four genomes, such as copy numbers and sequence compositions. However, since each genome was annotated separately with EDTA, the four non-redundant TE libraries might contain the same TE IDs like TE_00000000, making direct comparisons challenging.

Considering that these four species share a common Family and exhibit high sequence and gene collinearity, I've designed the following approach:

  1. Classify each non-redundant TE library based on sequence types, such as TIR, helitron, LTR-gypsy, LTR-copia, LTR-unknown, etc.
  2. Merge TE sequences of the same type from the four genomes, remove redundancy using CD-HIT, and rename sequence IDs.
  3. Perform subsequent analyses using the new IDs, such as comparing copy number and compositional differences of the same TE across different species.

Does this sequence similarity-based approach for merging non-redundant TE libraries seem scientifically make sense?

I'm appreciate your help.

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFYRPNYGXQPFW5EPI3XUBUABANCNFSM6AAAAAA3GM3BNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

YuboWang1994 commented 10 months ago

Hi,

Thanks for your help. It seems work.

Best Yubo Wang