training = True causes a GPU memory error, asks for 53 GB on normal sized proteins

yoann-glanum commented 1 year ago

Hello,

Sorry for bothering again,

I've been looking into manipulating the function parameters of the batch run function in the colab, This one, and got a crash.

Trying to isolate the parameters I've changed, it seems just activating training = True is enough, here are the parameters, crash message, and the input sequence :

---- parameters

all default params except :

input and output dirs, but it finds the right files and seems to read them fine (output the right pdb names)
stop at score 90 (nevermind, tried the default 100 too)
training = True

---- crash message

2023-01-11 09:30:23,449 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,452 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,454 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,465 Found 5 citations for tools or databases 2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517) 2023-01-11 09:30:29,785 Running model_3 2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536) 2023-01-11 09:30:53,880 Running model_3 2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553) 2023-01-11 09:31:16,237 Running model_3 2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:31:32,348 Done

---- example of fasta file

>7JLS_1|Chain A|Probable periplasmic dipeptide-binding lipoprotein DppA|Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (83332)
MAGLNDIFEAQKIEWHELEVLFQGPMSPDVVLVNGGEPPNPLIPTGTNDSNGGRIIDRLFAGLMSYDAVGKPSLEVAQSIESADNVNYRITVKPGWKFTDGSPVTAHSFVDAWNYGALSTNAQLQQHFFSPIEGFDDVAGAPGDKSRTTMSGLRVVNDLEFTVRLKAPTIDFTLRLGHSSFYPLPDSAFRDMAAFGRNPIGNGPYKLADGPAGPAWEHNVRIDLVPNPDYHGNRKPRNKGLRFEFYANLDTAYADLLSGNLDVLDTIPPSALTVYQRDLGDHATSGPAAINQTLDTPLRLPHFGGEEGRLRRLALSAAINRPQICQQIFAGTRSPARDFTARSLPGFDPNLPGNEVLDYDPQRARRLWAQADAISPWSGRYAIAYNADAGHRDWVDAVANSIKNVLGIDAVAAPQPTFAGFRTQITNRAIDSAFRAGWRGDYPSMIEFLAPLFTAGAGSNDVGYINPEFDAALAAAEAAPTLTESHELVNDAQRILFHDMPVVPLWDYISVVGWSSQVSNVTVTWNGLPDYENIVKAENLYFQGGHHHHHHHH
>7JLS_2|Chain B|Peptide SER-VAL-ALA|Escherichia coli BL21(DE3) (469008)
SVA

There are two chains but it warns that it only takes a single one (batch for multimer is another thing I have to look into), so it should run as monomer fine, especially with proteins of size 500 to 550.

The first thing I can think of that could do this was if the flag training = True caused to compute all the gradients for backprop, but by my understanding it's supposed to only activate the dropout layers ?

sokrypton commented 1 year ago

By default, is_training=True also disables the memory saving components of the model.

I guess one can go through and hard-code these to True across the code, or introduce low_memory option into system.

We just need to confirm the low_memory option isn't interfering with how dropouts are distributed...

On Wed, Jan 11, 2023, 4:44 AM yoann-glanum @.***> wrote:

Hello,

Sorry for bothering again,

I've been looking into manipulating the function parameters of the batch run function in the colab, This one https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/batch/AlphaFold2_batch.ipynb, and got a crash.

Trying to isolate the parameters I've changed, it seems just activating training = True is enough, here are the parameters, crash message, and the input sequence :

---- parameters

all default params except :

input and output dirs, but it finds the right files and seems to read them fine (output the right pdb names)

stop at score 90

training = True

---- crash message

2023-01-11 09:30:23,449 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_1JEV.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,452 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_5KZT.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,454 More than one sequence in /content/drive/MyDrive/ColabFold_files/batch/input_fasta/rcsb_pdb_7JLS.fasta, ignoring all but the first sequence 2023-01-11 09:30:23,465 Found 5 citations for tools or databases 2023-01-11 09:30:27,958 Query 1/3: rcsb_pdb_1JEV (length 517) 2023-01-11 09:30:29,785 Running model_3 2023-01-11 09:30:52,301 Could not predict rcsb_pdb_1JEV. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:30:52,329 Query 2/3: rcsb_pdb_5KZT (length 536) 2023-01-11 09:30:53,880 Running model_3 2023-01-11 09:31:15,591 Could not predict rcsb_pdb_5KZT. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:31:15,622 Query 3/3: rcsb_pdb_7JLS (length 553) 2023-01-11 09:31:16,237 Running model_3 2023-01-11 09:31:32,319 Could not predict rcsb_pdb_7JLS. Not Enough GPU memory? RESOURCE_EXHAUSTED: Out of memory while trying to allocate 53061779456 bytes. 2023-01-11 09:31:32,348 Done

---- example of fasta file

7JLS_1|Chain A|Probable periplasmic dipeptide-binding lipoprotein DppA|Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) (83332) MAGLNDIFEAQKIEWHELEVLFQGPMSPDVVLVNGGEPPNPLIPTGTNDSNGGRIIDRLFAGLMSYDAVGKPSLEVAQSIESADNVNYRITVKPGWKFTDGSPVTAHSFVDAWNYGALSTNAQLQQHFFSPIEGFDDVAGAPGDKSRTTMSGLRVVNDLEFTVRLKAPTIDFTLRLGHSSFYPLPDSAFRDMAAFGRNPIGNGPYKLADGPAGPAWEHNVRIDLVPNPDYHGNRKPRNKGLRFEFYANLDTAYADLLSGNLDVLDTIPPSALTVYQRDLGDHATSGPAAINQTLDTPLRLPHFGGEEGRLRRLALSAAINRPQICQQIFAGTRSPARDFTARSLPGFDPNLPGNEVLDYDPQRARRLWAQADAISPWSGRYAIAYNADAGHRDWVDAVANSIKNVLGIDAVAAPQPTFAGFRTQITNRAIDSAFRAGWRGDYPSMIEFLAPLFTAGAGSNDVGYINPEFDAALAAAEAAPTLTESHELVNDAQRILFHDMPVVPLWDYISVVGWSSQVSNVTVTWNGLPDYENIVKAENLYFQGGHHHHHHHH 7JLS_2|Chain B|Peptide SER-VAL-ALA|Escherichia coli BL21(DE3) (469008) SVA

There are two chains but it warns that it only takes a single one (batch for multimer is another thing I have to look into), so it should run as monomer fine, especially with proteins of size 500 to 550.

The first thing I can think of that could do this was if the flag training = True caused to compute all the gradients for backprop, but by my understanding it's supposed to only activate the dropout layers ?

— Reply to this email directly, view it on GitHub https://github.com/sokrypton/ColabFold/issues/352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA76LAQY2BZULJ4QAUGRUF3WRZ6HRANCNFSM6AAAAAATX2FDPU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

yoann-glanum commented 1 year ago

Oh so that's the intended / normal way of functioning then?

I was just raising the issue because I thought it might be a recent bug or something similar ^^

sokrypton commented 1 year ago

Yes, this is intended. But I'm not sure if the low_memory flag (which is what is_training disables) is only needed for proper gradient computation (which isn't needed) or also needed for proper dropout distribution. If it's not needed for the latter, we can set it to True!

On Wed, Jan 11, 2023, 7:46 AM yoann-glanum @.***> wrote:

Oh so that's the intended / normal way of functioning then?

I was just raising the issue because I thought it might be a recent bug or something similar ^^

— Reply to this email directly, view it on GitHub https://github.com/sokrypton/ColabFold/issues/352#issuecomment-1378697692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA76LAXL4JF7DJV7BITQ733WR2TUDANCNFSM6AAAAAATX2FDPU . You are receiving this because you commented.Message ID: @.***>

yoann-glanum commented 1 year ago

I see I see,

It seems wild to me that the low_memory flag is the difference between 6-7 GB of GPU RAM in the default run on these sequences, and 53 GB, that's a beefy optimisation.

I was assuming the is_training flag would have just affected the layers in the inference, and that the gradient computation would be a fully separated step that would not be activated.

Unless dropout needs very particular ways of randomness for their activation, I don't see these layers requiring any more computation effort than an average proba distribution sampling?

(I don't particularly need that option for what I'm doing I'm just trying to test everything, but it's good to know there'd be a way to make it work)

sokrypton commented 1 year ago

alphafold v2.3.1 fixes this issue by using global_config to enable dropouts instead of using is_training (which also disables low-memory) See commit here https://github.com/deepmind/alphafold/commit/f96e254e6da73daba42db4d38727f71c2f6fcee5

In the colabfold beta branch, I've switched to this setup to enable dropouts. Try it out: https://colab.research.google.com/github/sokrypton/ColabFold/blob/beta/AlphaFold2.ipynb

yoann-glanum commented 1 year ago

(Sorry I had to be off work for a few days)

Tested it on the v3, it does seem to run fine now, that's great for more in depth exploration of conformations (when runtimes allow it 😬 ).

I've also tried out some of the other new options, the automatic 20 recycles with 0.5 recycle tolerance, 2 seeds, save_all, and had a few thoughts :

does the recycle tolerance have the same interpretation as the inter-models stop_at_score ? i.e a tolerance of 0.9 means we stop the recycles when we reach [metric] = 0.9? (instinctively a tolerance would go the other way, with tol = 0 <=> stop_at = 100)
the recycle tolerance seem to go straight to the load models functions, unlike the rank_by or stop_at that exist just in the batch.py and seem quite simple to manipulate, is it possible to choose what (plddt, ptm, iptm) the tolerance looks at?
other than not being sure what the tolerance was looking at, i do have different models ending with different recycle numbers
seeds seem to work as well
save_all works, at least in the local colab file folder and not the zip or other (though that fits with the script implementation so I assume it's intended)

Boy do I have a few things to catch up to in my colab / fork

sokrypton commented 1 year ago

Thanks for the report! The save_all option should be saving to output zip, I'll fix this! tolerance is a metric was introduced by deepmind in v2.3.1, so we dont really want to mess with this at the moment. What this does is monitor the RMSD between recycles and as soon as the model converges it terminates. To stop_at is also doing this, but it also terminates when any of the models hit a specific score. Maybe this should be an option...

sokrypton / ColabFold

training = True causes a GPU memory error, asks for 53 GB on normal sized proteins #352