Closed guanzhchen closed 7 months ago
Are parameters optimized (backward) for each chunk rather than the whole long sequence?
The whole sequence.
have you tried if there are any approximation errors or the optimization is length-agnostic?
That makes sense! Thank you!
Thanks for your exciting work!
I found the extract_local function seems to split the input sequence length L into L/world_size. Are parameters optimized (backward) for each chunk rather than the whole long sequence? So have you tried if there are any approximation errors or the optimization is length-agnostic?