Hello !
I wanted to run the MSA calculation of a set of proteins to do structure prediction. I used the script that calls the mmseqs2 API that colabfold uses for the MSA calculation.
Except that now, I wanted to use mmseqs2 directly to be able to run my calculation in a cluster.
To do this, I used the script colabfold_search.sh without a precomputed index(https://gist.github.com/milot-mirdita/67509c248746c4c774128fc84ab91b6f), with the two databases uniref30_2103 and colabfold_envdb_202108. I set USE_ENV to 1, USE_TEMPLATE to 0 and the FILTER to 1.
The problem is that the resulting MSA is very different from the MSA I got from using the API.
For example, with this protein sequence of 679 in length.
With the API I get an MSA of 20446 sequences, while with colabfold_search.sh I get an MSA of 20865 in size, and only 1150 sequences in common between the two methods.
Is there any way to get the same output as the API using the colabfold_search.sh script?
the input and the outputs are here :
https://drive.google.com/drive/folders/1ZcAHKRzxT4hK-Bjb8ZKozDfaO_cTptwO?usp=sharing
The msa of the mmseqs2 API is stored in msa_api.pickle
The msa of mmseqs2 launched in the cluster is in the form of file.a3m I converted it myself in file msa_mmseqs2.pickle to make the comparison.
Hello,
I finally found the solution. I had to use the colabfold script the parse.parse_a3m of alphafold on my outputs so that they are similar to the API outputs.
Hello ! I wanted to run the MSA calculation of a set of proteins to do structure prediction. I used the script that calls the mmseqs2 API that colabfold uses for the MSA calculation. Except that now, I wanted to use mmseqs2 directly to be able to run my calculation in a cluster. To do this, I used the script colabfold_search.sh without a precomputed index(https://gist.github.com/milot-mirdita/67509c248746c4c774128fc84ab91b6f), with the two databases uniref30_2103 and colabfold_envdb_202108. I set USE_ENV to 1, USE_TEMPLATE to 0 and the FILTER to 1. The problem is that the resulting MSA is very different from the MSA I got from using the API. For example, with this protein sequence of 679 in length. With the API I get an MSA of 20446 sequences, while with colabfold_search.sh I get an MSA of 20865 in size, and only 1150 sequences in common between the two methods. Is there any way to get the same output as the API using the colabfold_search.sh script? the input and the outputs are here : https://drive.google.com/drive/folders/1ZcAHKRzxT4hK-Bjb8ZKozDfaO_cTptwO?usp=sharing