Closed reubwn closed 4 years ago
Hi @reubwn,
As you mentioned, one way is to use cd-hit for clustering. I have been using the 80-90-100 rule for LTR sequences, which is a minimum of 80% identity, 90% coverage, and 100bp long for clustering sequences. You may also use this utility script in the EDTA package to cluster the aggregated libraries: https://github.com/oushujun/EDTA/blob/master/util/cleanup_nested.pl
Best, Shujun
Hi Shujun, Thank you for your suggestions!
I am closing this thread. Please let me know if you have further questions.
Hi Shujun,
I would like to generate a non-redundant LTR library across multiple input genomes, some of which are within the same species and others of which are quite divergent (eg. across genera). TL;DR: what is the best practice for doing this with your tools?
For example, I have run LTR_retriever successfully on each genome individually, and so already have a combined LTR library that presumably contains quite a lot of redundancy among the within-species samples. I could cluster using CD-HIT or similar, but I wondered if there was a more refined approach using the LTR_retriever program directly. One idea would be to simply combine all scaffolds together and run the whole pipeline from scratch, but this would take an intractably long time. Another idea was to cat the *rawLTR.scn files from the individual runs and give this along with the combined fasta file to LTR_retriever, i.e. something like:
But I don't think that will work as the sequence indexing in the *scn file is broken when the scaffolds are combined.
Do you have any thoughts as to how to achieve this? Perhaps there are some submodules of LTR_retriever that could be run on these files to generate the cross-species library?
Many thanks for your help!