Open wheresmyhair opened 2 weeks ago
The del
object doesn't work because other variables are still pointing to the tensors, likely the linked hp_params.
@wheresmyhair, we have put some effort into enabling better memory management. Please see the following links for relevance to your scenario:
I'm developing a peft algorithm, basically it does the following:
Say the training process has 30 steps in total,
lmhead
+layer_0
lmhead
+layer_1
lmhead
+layer_0
The key point is that, after the switch, the states of
lmhead
are expected to be kept, while the states of the body layers should be deleted. For example, thestep
inlmhead
state should go from 0 to 29, whilestep
for body layers count from 0 after every switch, even if the layer has been selected before.In this case, the parameter group looks like:
The first two represents layers that states should be kept, while the last two will change.
An approach I came up with is that partially "re-init" the optimizer at the beginning of the step that should do the switch. I modified my huggingface trainer based on ds optimizer
__init__
method:However, I found
del obj
not working, as the mem profiling result shown below:I noticed the tensors the arrows point at spawn when:
Are there any insights?