nebiolabs / domainator

A flexible and modular software suite for domain-based gene neighborhood and protein search, extraction, and clustering.
Other
11 stars 0 forks source link

faster clustering of DNA sequences #11

Open seanrjohnson opened 3 months ago

seanrjohnson commented 3 months ago

deduplicate_genbank.py is really slow on nucleotide sequences, relying on cd-hit for clustering. There are probably faster ways to cluster nucleotide sequences that we should look into integrating.