tokenization issue for code

brando90 commented 1 year ago

Does this still a bug for tokenization? I want to use this for code. Thanks!

gjmulder commented 1 year ago

If you are talking about the fast encoder, it was fixed in the main branch of transformers. AFAIK it hasn't been tagged as a release, yet.

gjmulder commented 1 year ago

Probably a duplicate of #40?

young-geng commented 1 year ago

Check out our OpenLLaMA v2 model, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon.

brando90 commented 1 year ago

can we use the old models or how does this work? We just load the old model with the new tokenizer?

Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University @.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html

On Jul 7, 2023, at 12:52 AM, Xinyang (Young) Geng @.***> wrote:

Check out our OpenLLaMA v2 modelhttps://huggingface.co/openlm-research/open_llama_7b_v2, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon.

— Reply to this email directly, view it on GitHubhttps://github.com/openlm-research/open_llama/issues/61#issuecomment-1624940429, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOE6LRSBDRPTWUHHJTCYHDXO654DANCNFSM6AAAAAAZWCI5DE. You are receiving this because you authored the thread.Message ID: @.***>

young-geng commented 1 year ago

@brando90 The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too.

brando90 commented 1 year ago

Got it. Thanks!

I will assume v1 open llama is basically unusable for code gen (what I want) and use only v2.

Thanks!

Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University @.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html

On Jul 7, 2023, at 11:55 AM, Xinyang (Young) Geng @.***> wrote:

@brando90https://github.com/brando90 The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too.

— Reply to this email directly, view it on GitHubhttps://github.com/openlm-research/open_llama/issues/61#issuecomment-1625894602, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOE6LTTBRYD5CYUEVWXAGDXPBLQXANCNFSM6AAAAAAZWCI5DE. You are receiving this because you were mentioned.Message ID: @.***>

young-geng commented 1 year ago

@brando90 Yeah. I imagine you probably want to use v2 almost always since it is a better model overall.

openlm-research / open_llama

tokenization issue for code #61

can we use the old models or how does this work? We just load the old model with the new tokenizer?

Thanks!