non-redundant LTR library across multiple species

reubwn commented 4 years ago

Hi Shujun,

I would like to generate a non-redundant LTR library across multiple input genomes, some of which are within the same species and others of which are quite divergent (eg. across genera). TL;DR: what is the best practice for doing this with your tools?

For example, I have run LTR_retriever successfully on each genome individually, and so already have a combined LTR library that presumably contains quite a lot of redundancy among the within-species samples. I could cluster using CD-HIT or similar, but I wondered if there was a more refined approach using the LTR_retriever program directly. One idea would be to simply combine all scaffolds together and run the whole pipeline from scratch, but this would take an intractably long time. Another idea was to cat the *rawLTR.scn files from the individual runs and give this along with the combined fasta file to LTR_retriever, i.e. something like:

./LTR_retriever -genome combined_genomes.fa -inharvest combined_rawLTR.scn

But I don't think that will work as the sequence indexing in the *scn file is broken when the scaffolds are combined.

Do you have any thoughts as to how to achieve this? Perhaps there are some submodules of LTR_retriever that could be run on these files to generate the cross-species library?

Many thanks for your help!

oushujun commented 4 years ago

Hi @reubwn,

As you mentioned, one way is to use cd-hit for clustering. I have been using the 80-90-100 rule for LTR sequences, which is a minimum of 80% identity, 90% coverage, and 100bp long for clustering sequences. You may also use this utility script in the EDTA package to cluster the aggregated libraries: https://github.com/oushujun/EDTA/blob/master/util/cleanup_nested.pl

Best, Shujun

reubwn commented 4 years ago

Hi Shujun, Thank you for your suggestions!

oushujun commented 4 years ago

I am closing this thread. Please let me know if you have further questions.

oushujun / LTR_retriever

non-redundant LTR library across multiple species #63