rcedgar / muscle

Multiple sequence and structure alignment with top benchmark scores scalable to thousands of sequences. Generates replicate alignments, enabling assessment of downstream analyses such as trees and predicted structures.
https://drive5.com/muscle
GNU General Public License v3.0
186 stars 21 forks source link

How can I download v5 muscle? #78

Closed brilliant2643 closed 2 months ago

brilliant2643 commented 2 months ago

Hi! I'm using muscle to align my sequence, but I got the error :

`MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle This software is donated to the public domain. Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

genomes 676595 seqs, lengths min 200, max 3385249, avg 558 segmentation fault (core dumped)`

So I search for this error in Issues, and I found that maybe I got this error due to the length of this sequence is too long (I'm not quite sure about this), and one of the reason said that maybe in version5 will solve this problem. So I download the latest version in release but I find the binary file I download is version3.8. How can I download v5 muscle?

Thanks, Sushi

rcedgar commented 2 months ago

You can download binary files by going to the GitHub repository home page, click "Releases". But, this set is far too large for any multiple sequence alignment software, the sequences are too long and there are far too many of them. From the sequence lengths and name "genomes" I'm guess these are bacterial genomes. Even if MUSCLE could align a set like this, bacterial genomes are not globally alignable unless they are very closely related strains -- genomes in the same species often have quite different gene content. What is the goal of making the alignment? Maybe I can suggest a different approach.

brilliant2643 commented 2 months ago

You can download binary files by going to the GitHub repository home page, click "Releases". But, this set is far too large for any multiple sequence alignment software, the sequences are too long and there are far too many of them. From the sequence lengths and name "genomes" I'm guess these are bacterial genomes. Even if MUSCLE could align a set like this, bacterial genomes are not globally alignable unless they are very closely related strains -- genomes in the same species often have quite different gene content. What is the goal of making the alignment? Maybe I can suggest a different approach.

Hi! Thanks for your supply! I'm a beginner at bioinformatics, so I don't know if it's correct for me to use muscle to align the genomes for preparing the file for input into FastTree. By the way, the file named 'genome' are genomes collected from several samples, so it is very big. Do you have any good suggestions on how to get a phylogenic tree from the genome file?

Thanks again for your suggestion! Sushi

rcedgar commented 2 months ago

You can make a tree from the 16S rRNA gene which is found in all bacterial genomes: https://en.wikipedia.org/wiki/16S_ribosomal_RNA To find the 16S gene, use the search16s command in usearch https://rcedgar.github.io/usearch12_documentation/cmd_search_16s.html There will be many duplicates and very closely related 16S sequences because one genome can have several identical or near-identical copies of 16S, and you may have closely related strains in your samples. To reduce the number of genes for alignment and tree-building, you could cluster at 99% identity using the cluster_fast command in usearch https://rcedgar.github.io/usearch12_documentation/cmd_cluster_fast.html

rcedgar commented 2 months ago

Closing because not really a MUSCLE issue, @brilliant2643 you are welcome to email me with further questions about how to do this.

brilliant2643 commented 1 month ago

You can make a tree from the 16S rRNA gene which is found in all bacterial genomes: https://en.wikipedia.org/wiki/16S_ribosomal_RNA To find the 16S gene, use the search16s command in usearch https://rcedgar.github.io/usearch12_documentation/cmd_search_16s.html There will be many duplicates and very closely related 16S sequences because one genome can have several identical or near-identical copies of 16S, and you may have closely related strains in your samples. To reduce the number of genes for alignment and tree-building, you could cluster at 99% identity using the cluster_fast command in usearch https://rcedgar.github.io/usearch12_documentation/cmd_cluster_fast.html

Thanks for your suggestion! I'll try it!