Closed joecummings closed 3 days ago
Note: Links to docs will display an error until the docs builds have been completed.
As of commit cdf5cdfcc4c5ade680e9229ffd02c86cf7891599 with merge base abe798d5f7af7761fcf3064b42fb699c7ef19fcd (): :green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Hi Joe, Thank you for your work! I have been trying to do something similar. I'm just reading through your changes. Is the purpose of 'on-the-fly' packing to reduce overall memory overhead by generating the attn-mask on the fly instead of during _add_pack?
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 66.72%. Comparing base (
abe798d
) to head (60d19ab
). Report is 5 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi Joe, Thank you for your work! I have been trying to do something similar. I'm just reading through your changes. Is the purpose of 'on-the-fly' packing to reduce overall memory overhead by generating the attn-mask on the fly instead of during _add_pack?
Yep, constructing the mask during training reduces memory by about 99% and only slightly slows down processing.
Also, how difficult would it be to also move the
input_pos
creation togetitem
as well? Probably not as significant of a memory saver as the mask, but might still be worthwhile
Great question! I think I actually could do this w just the seq_len
information, but it would entail generating and concat-ing multiple tensor arrays during the getitem, which would slow down processing and only save a little bit of memory. I could do some tests to confirm the tradeoff though.
Context
As investigated in #1097, it was shown that the offline approach to constructing the mask consumed waaaaaay too much memory. Therefore, this approach constructs tokens, labels, and input_pos offline and then constructs the mask during access (training). For a max_seq_len of 4096 (default for many models), we can expect the memory of a single pack to look like the following offline:
To provide a real-world example, let's use the Web Instruct Dataset from Tiger Labs. It comes in at 3.51 GB of size with 2.3 million samples. The average sample length (with instruct template applied) is about 100 tokens. This means that 40 samples fit in each pack if we don't split across packs. Therefore we can expect there to be about 57,500 packs. This number times 0.1MB is 5.75GB additional memory bringing the total on-disk memory needed to load (before training) this dataset is 9.26GB, well within reasonable bounds.
Why do we need
seq_lens
?: Technically we could calculate this using theinput_pos
, but this would save us negligible memory and increase processing time during training, which is undesirable.Why are you using this dataset? It's a large dataset downloaded 33,026 times in the last month. Good a baseline as any.
Why did you update the signature to take in a padding_idx and hardcode in CROSS_ENTROPY_IGNORE_IDX? Excellent question. So before, the packed dataset made the assumption that padding_idx = 0 and to use the CROSS_ENTROPY_IGNORE_IDX. The former is NOT an assumption we can make therefore it should be actually configurable and the latter IS a reasonable assumption so we should just hardcore it instead of defaulting the param (which won't get used).
Changelog
PACK_TYPE
so I don't have to keep typing itseq_lens
to offline variables to hold information on each seq len in the pack (useful for calculating mask on the fly)Test plan
All are passing
Using this gist: https://gist.github.com/joecummings/05586af0a08eef0714c7da3c56ee7365
Only packing 1% of the dataset which is 23k samples. Using our calculation from above we expect memory usage with the new implementation to take an additional 0.58GB.
Our calculation looks pretty spot on for how much memory the new implementation should take. And it makes sense that there would be a little more memory used when the mask is constructed during dataloading.
Not surprising that the old packing takes much longer than the new masking.
Why do we need to do this? Well, the above "loading" is not a true measurement of how packing a dataset will affect the training time. For instance, we are now passing in a constructed mask for attention instead of relying on SPDA to construct one for us.
CMD:
YAY, it's just (kinda) as fast as the old implementation in waaaaaay less memory.