sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.8k stars 463 forks source link

Different outputs between mmseqs2 API and colabfold_search #197

Open YFeriel opened 2 years ago

YFeriel commented 2 years ago

Hello ! I wanted to run the MSA calculation of a set of proteins to do structure prediction. I used the script that calls the mmseqs2 API that colabfold uses for the MSA calculation. Except that now, I wanted to use mmseqs2 directly to be able to run my calculation in a cluster. To do this, I used the script colabfold_search.sh without a precomputed index(https://gist.github.com/milot-mirdita/67509c248746c4c774128fc84ab91b6f), with the two databases uniref30_2103 and colabfold_envdb_202108. I set USE_ENV to 1, USE_TEMPLATE to 0 and the FILTER to 1. The problem is that the resulting MSA is very different from the MSA I got from using the API. For example, with this protein sequence of 679 in length. With the API I get an MSA of 20446 sequences, while with colabfold_search.sh I get an MSA of 20865 in size, and only 1150 sequences in common between the two methods. Is there any way to get the same output as the API using the colabfold_search.sh script? the input and the outputs are here : https://drive.google.com/drive/folders/1ZcAHKRzxT4hK-Bjb8ZKozDfaO_cTptwO?usp=sharing

YFeriel commented 2 years ago

Hello, I finally found the solution. I had to use the colabfold script the parse.parse_a3m of alphafold on my outputs so that they are similar to the API outputs.