sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.79k stars 461 forks source link

Generating MSAs for chimeric proteins #564

Open wcorcoran opened 5 months ago

wcorcoran commented 5 months ago

I'm attempting to generate predictions for multimeric complexes, where one chain is a chimeric/fusion protein and has components from different organisms. About ~85% of this chain is from one organism and the other ~15% is from another organism. The sequence coverage plot shows virtually 0 sequences aligning to the latter (and the predicted structure is very low confidence), likely due to this 15% component comprising such a small part of the overall protein that the sequences that align score very low in the search and are omitted. Is there any way to 'lock' segments of a single protein (especially chimeric proteins) to ensure diversity of sequences across that chain if the chain has components from different lineages?

Alternatively, if I need to create a custom MSA, is there a way to extract a .a3m file from ColabFold for the sequences that were used as the input (those shown in the sequence coverage plot)? All of the .a3m files that I can access seem to contain ~20k sequences, whereas the coverage plot only shows the final ~3k that went through the final seq diversity filtering.

Many thanks in advance!

milot-mirdita commented 5 months ago

Could you post the MSA plots please?

The pairing procedure works by matching same species taxon identifiers, so the paired part of the plot is expected to be empty. The unpaired part should still contain a lot of hits.

wcorcoran commented 5 months ago

Thanks for the quick reply! I've attached an example coverage plot. Seq 1 and Seq 2 both have components that comprise ~13% of that total chain and are derived from a different lineage than the other 87%. The coverage drops sharply, even in the unpaired alignment. As a test, I've tried removing a chunk of the N-terminal sequence of Seq 1 and Seq 2, so that the 13% part now constitutes ~30-40% of the total chain, and then I begin to get better coverage for that component, but at the expense of predicting the whole chain. I also tried using only unpaired sequences, and I get better coverage for the 13% part, but the quality of the prediction suffers a lot and the resulting predicted structure looks quite unreasonable.

paired_unpaired