microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

Longnet Code Release #37

Closed arnavdantuluri closed 10 months ago

arnavdantuluri commented 1 year ago

Hello all, just wondering whether there is an ETA on the official release of the long net code, it was mentioned here https://github.com/microsoft/unilm/issues/1182#issuecomment-1624938095 that the long net code would be released as part of torchscale. Looking forward to the seeing the official implementation!

JacksonZ03 commented 1 year ago

I sent an email to one of the researchers. Hopefully, I get a reply. This paper is honestly a giant step forward

arnavdantuluri commented 1 year ago

Please let us know if you get a reply 🙂

LiutongZhou commented 1 year ago

Looking forward to it

pokameng commented 1 year ago

looking forward to the code releasing!!!

AlexanderChen1989 commented 1 year ago

please release the code!

CanyonWind commented 1 year ago

Would appreciate it a lot if it's possible to give the community a heads-up when the ETA would be

fkodom commented 1 year ago

For anyone that's interested, I built an (unofficial) LongNet implementation here: https://github.com/fkodom/dilated-attention-pytorch

There are no pretrained weights -- I don't have the personal compute budget for that. 😂 But the main concepts are there, and I reproduced scaled-down versions of the inference benchmarks.

Hopefully it's interesting for tinkering and exploration, at least until the official code comes out. 😉

MHarris021 commented 1 year ago

Any updates on when this might officially come out? @fkodom 's repo https://github.com/fkodom/dilated-attention-pytorch is a pretty good implementation while we wait. @DeepDream2045 and I were able to benchmark it processing 64 Million tokens on an RTX A5000 in Linear time. It scaled up nicely to handle 256 Million tokens on a single A100 as well. This was with 8 embed_dim and 4 num_heads. We were also able to validate that the MultiheadDilated Attention Class could handle 32 Million tokens with an embed_dim 128 and num_heads 8 on an A100. Lastly, it seems that the embed_dim and token relationship is entirely linear with regards to memory and runtime. Increasing the number of heads has the effect of linearly increasing runtime. I've lastly plotted the results for variations in the number of embed_dim and num_heads against 4 Million tokens on an A100, starting with 1024 embed_dim and 32 heads and working down to 32 embed_dim and 4 heads.

I've modified the benchmark.py file to make benchmarking easier here: https://github.com/DarcStar-Solutions-Tech/dilated-attention-pytorch

benchmark-256M-tokens-2023-08-15

tokens-32M-embed_dim-128-heads-8

tokens-1M-embed_dim-1024-heads-32

tokens-4M-embed_dim-1024-heads-32

agemagician commented 1 year ago

@shumingma @gitnlp @sunyt32 @donglixp @buaahsh @microsoftopensource

Any update regarding release date for the code ?

I am currently working in a project with Google and I am interested to benchmark your new architecture.

agemagician commented 11 months ago

Your response is highly appreciated @shumingma @gitnlp @sunyt32 @donglixp @buaahsh @microsoftopensource

agemagician commented 11 months ago

Thanks @donglixp for assinging @shumingma for the model release.

@shumingma any estimation when the model code will be released ?

agemagician commented 10 months ago

@shumingma and @donglixp it would be great if you could share the timeline for all of us.

shumingma commented 10 months ago

Hi all, we have finally released the code of LongNet (together with LongViT). Thank you for your patience. Have fun!