lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

RelativePositionBias in CoarseTransformer #129

Closed cyanbx closed 1 year ago

cyanbx commented 1 year ago

Hi. It seems that the RelativePositionBias in CoarseTransformer will limit the sequence length during inference, as it cannot embed an integer relative position larger than the max length it meets during training, and therefore cannot handle sequence longer than the max training length. Is there any alternative positional embedding method?

lucidrains commented 1 year ago

it actually is the best positional bias for length extrapolation of everything in literature. the bias is parameterized as a continuous function by a small mlp

cyanbx commented 1 year ago

it actually is the best positional bias for length extrapolation of everything in literature. the bias is parameterized as a continuous function by a small mlp

But how can the mlp embed an integer larger than all possible position it has seen during training? I have actually encountered quality degradation on generating audios longer than the max length used during training.

lucidrains commented 1 year ago

it won't extrapolate to any length, usually up to 4x in language modeling

if you need greater lengths, recommend fine tuning at the end

lucidrains commented 1 year ago

it represents the positions as a continuous function. recommend reading NERFs and implicit representations

lucidrains commented 1 year ago

@cyanbx could be a good research topic, if you are looking to get a graduate degree, just saying :)

lzl1456 commented 1 year ago

@lucidrains Is it related to the relative positional bias scheme used?

ALiBi Seems to be more efficient for extending to long sequences image

lucidrains commented 1 year ago

@lzl1456 imo alibi has a flaw where it restricts the attention to be too local

but i could eventually offer that yes, if one doesn't care about global coherence. it probably has more reliability in extrapolation past 3-4x the sequence length