tanreinama / GPTSAN

General-purpose Swich transformer based Japanese language model
MIT License
117 stars 4 forks source link

HuggingFace integration #2

Closed younesbelkada closed 1 year ago

younesbelkada commented 1 year ago

Dear authors,

I am very excited about the release of this model! Do you plan to integrate the model into Hugging Face transformers? We already support Switch Transformers so adding the model can be easier than expected. I can of course guide you on how the exact steps of how to integrate the model

よろしくお願いします!

tanreinama commented 1 year ago

Wow, Thank you, great suggestion.

Implementing HaggingFace is something I've always wanted to do. However, I'm not good at English, so I was having trouble deciphering the documentation. To be honest, I need some help. In particular, GPTSAN uses its own text encoding, but I didn't understand how to port it to HaggingFace. If you are knowledgeable about porting and can support me, I would love to get to work. I will always be grateful to HaggingFace.

thank you.

younesbelkada commented 1 year ago

Thank you for your message @tanreinama I can guide you through the process step by step, as a first step I would suggest to start forking the transformers repo and create a new branch.

1- After that, create a virtual environment and install the transformers locally by running pip install -e ".[dev]" 2- Create a new model squeleton by running transformers-cli add-new-model-like and copy the model from switch_transformers 3- Regarding the text encoding (tokenizer) I see that this tokenizer is similar to the one that is used by Japanese GPT-NeoX: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py a good starting point could be to import this tokenizer and compare it with yours to see if they produce the same results

from transformers import AutoTokenizer

tokenizer_gpt_neox = AutoTokenizer.from_pretrained("abeja/gpt-neox-japanese-2.7b")
...

Let me know what do you think and if you have any questions!

tanreinama commented 1 year ago

The encoder seems to be able to use gpt_neox_japanese as it is. I created a local branch as per your instructions and created a model. I ported some layers so I could reproduce the model structure. I changed the configuration file and removed the parameters I don't need. What should I do next? I need to convert it to a PyTorch model file. And create the sentence generation code.

By the way, there is a place in the switch_transformers code that bothers me. At the point where Top1Router adds jitter to the input, This should only work during training. I can't find the code to remove it during Prediction...

younesbelkada commented 1 year ago

This is great thank you so much for updating me on the current progress! I am very excited about that! I think for now you can just hardcode config.router_jitter_noise to 0 (manually set 0 on your new configuration file)

Yes the next step is to be able to convert your model in PyTorch, and check if you get the same generations! Let me know how I can help next, Younes

tanreinama commented 1 year ago

Pytorch porting and creation of generation logic completed I'm here now

# import direct from new model directory
from transformers.models.gptsan_japanese import modeling_gptsan_japanese, configuration_gptsan_japanese

# create model
conf = configuration_gptsan_japanese.GPTSANJapaneseConfig()
model = modeling_gptsan_japanese.GPTSANJapaneseModel(conf)
model.load_state_dict(torch.load("GPTSAN-2.8B-spout_is_uniform.pt"))

# model run
model(input_ids=torch.zeros([1,200]).int())
> CausalLMOutputWithPast(loss=None, logits=tensor([[[ -2.4805,  -6.7204,  -6.4591,  ...,  -4.8195, -19.6129, -16.9775],
>          [ -3.5968,  -7.8980,  -6.6534,  ...,  -5.5144, -19.7407, -17.9312],
>          [ -4.5876,  -7.7133,  -6.7145,  ...,  -6.6660, -20.1992, -18.0129],
>          ...,
>          [ -1.7173,  -8.6070,  -6.2956,  ...,  -6.6117, -21.1183, -19.6286],
>          [ -2.2849,  -8.7263,  -6.5163,  ...,  -6.7639, -21.3613, -19.8236],
>          [ -2.0467,  -8.6484,  -6.4396,  ...,  -6.6144, -21.0991, -19.6714]]],
>        grad_fn=<AddBackward0>), past_key_values=None, hidden_states=None, attentions=None)

# original encoder
from encode_swe import SWEEncoder_ja
import json
with open('ja-swe36k.txt') as f:
    bpe = f.read().split('\n')
with open('emoji.json') as f:
    emoji = json.loads(f.read())
enc = SWEEncoder_ja(bpe, emoji)

# sentence generation
x_tok = enc.encode("武田信玄は、")
model = model.cuda()
res = model.generator.generate_lm(x_tok)
print(enc.decode(res[0]))
> 自ら戦場を訪ね、その果て、自負を持つ将に太刀打ち、自らと敵対するために、幾度となく軍民を重用してきた...

# masked language model
x_tok = enc.encode("武田信玄は、<|inputmask|>時代ファンならぜひ押さえ<|inputmask|>きたい名将の一人。")
model = model.cuda()
res = model.generator.predict_mlm(x_tok)
print(enc.decode(res[0]))
> 武田信玄は、戦国時代ファンならぜひ押さえておきたい名将の一人。

However, the tokenizer of GPTNEOX-japanese could not be used. The program is the same, but the number of words is different. They have 32K tokens, I have 36K tokens. And then I run into the first difficulty. How can I implement my own tokenizer? I can copy the code from GPTNEOX-japanese but I need use a new vocabulary file (ja-swe36k.txt).

younesbelkada commented 1 year ago

Hi @tanreinama Thanks so much for sharing the progress and I am super excited that you have managed to move forward with the implementation and get some consistent results with the original model - I am super excited about the current progress! Regarding this issue, I think that you can first try to modify these lines here: https://github.com/huggingface/transformers/blob/8fb4d0e4b46282d96386c229b9fb18bf7c80c25a/src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py#L50 and see if you can load your vocab file instead. If it works, I advice you to use Hugging Face hub: https://huggingface.co/ and create a new model that will include your model weights (first create an account and create a repo) and vocab file + emoji json file. See for comparison the files here: https://huggingface.co/abeja/gpt-neox-japanese-2.7b/tree/main You can directly push the weights using the push_to_hub method after loading the weights as such: model.push_to_hub("tanreinama/GPTsan") After that I think that you will be more than ready to open a draft pull request on transformers! Don't forget to ping me & my colleage @ArthurZucker as we worked together on integrating SwitchTransformers and we will be more than excited to guide you from there Again thank you!

ArthurZucker commented 1 year ago

初めまして!And thanks for your work. If the only difference between your tokenizer and GPTNeoX-Japanese is the vocabulary file, you indeed only need to initialize your tokenizer with the custom vocab.json file 😉 You can do this by simply running the following snippet :

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("abeja/gpt-neox-japanese-2.7b", vocab_file = f"{path_to_my_file}")
>>> tokenizer.push_to_hub("tanreinaman/gptsan-japanese")

Now regarding the addition of the model, it should only be done if the architecture is different (if it is very very similar, we should be using custom_models and not add the modelling file). In that case we could also add for example just the modified part (if you have a new ExpertLayer that is fondamentally different).

Otherwise, you should just convert your checkpoint and use the current hugging face model.

tanreinama commented 1 year ago

@younesbelkada @ArthurZucker Thank you so match! Thanks to that, I was able to create a repository on HuggingFace Hub and upload the model. But it's still not perfect arownd Tokenizer. One is GPTNEOX-Japanese Tokenizer does not handle "," properly. ",'' is an Ascii character, so it is interchangeable with <|byte44|>. So put code to absorb it on the model side and output <|byte44|>, it will work. The other one is more deeply rooted. My Tokenizer handles unencodable Unicode by encoding it into 256 types of <|byteXX|> tokens. This <|byteXX|> token decoding in multibyte UTF doesn't work at all. In other words, it is not possible to convert <|byte227|><|byte130|><|byte161|> to one character "ァ". GPTNEOX-Japanese outputs bytearray([227]).decode("utf", errors="replace") + bytearray([130]).decode("utf", errors="replace") + bytearray([161]).decode("utf", errors="replace"). The original encoder outputs bytearray([227,130,161]).decode("utf", errors="replace") so that's fine. These may be bugs in GPTNEOX-Japanese, but their model may have been trained on that assumption, so it may not be a problem for them.

Here I am think about someting. It will probably currentry work to some extent. I may be able to submit a pull request in this state. But to make this perfect, I need to modify the Tokenizer source code...

tanreinama commented 1 year ago

PR Created!

thank you for support. I was able to submit a PR to HaggingFace today. The slight difference of Tokenizer could be absorbed by the code on the model side. I'm excited to contribute to HaggingFace. thank you!

ArthurZucker commented 1 year ago

Awesome! Will review asap 😉