kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

Do you think bigscience/bloom can be a replacement of facebook/opt model ? #7

Closed svjack closed 1 year ago

svjack commented 1 year ago

If I want to replace the lm model in the project, do you prefer bigscience/bloom as a multilanguage replacement ? Or you have some other recommendations ? I want the model replaced can works in question-answer downstream works. And I'm interesting about why the loss you use not related with qa tasks, but the model can works in question-answer downstream works. Does this only use the few-shot ability of Facebook/opt ?

And i see you use "bert" as a option in if-else judgment block in models.py This mean you take "bert" as a replacement, Can you share a FrozenArgs configuration of "bert" model ?

kohjingyu commented 1 year ago

In principle, there is nothing special about OPT. Using BLOOM should also work, as long as you update the model API calls (if they are different, which they might not be).

The BERT models were something used early in development. We didn't train any BERT-like models in the final version, so I don't have any config files for them, sorry.

svjack commented 1 year ago

In principle, there is nothing special about OPT. Using BLOOM should also work, as long as you update the model API calls (if they are different, which they might not be).

The BERT models were something used early in development. We didn't train any BERT-like models in the final version, so I don't have any config files for them, sorry.

Why Bloom tokenizer when use padding to max_length it will placed the padding tokens to the head ?

native_tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m", 
                                                      use_fast=False)
caption = "a bear in the woods."
tokenized_data = native_tokenizer(
          caption,
          return_tensors="pt",
          padding='max_length',
          truncation=True,
          max_length=56)
tokens = tokenized_data.input_ids[0]
tokens

will produce

tensor([     3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,      3,      3,      3,      3,
             3,      3,      3,      3,      3,     68,  50507,    361,    368,
        165526,     17])

It pad the pad_token_id "3" to the head, not tail. This is different with other models. Why this occurred ?

kohjingyu commented 1 year ago

I've never used the BLOOM models before, so I don't know what this issue is, sorry. I think this is something you will have to check with the authors of that model.

svjack commented 1 year ago

After training, below code init ret embedding

  with torch.no_grad():
      model.model.input_embeddings.weight[model.model.retrieval_token_idx, :].copy_(checkpoint['state_dict']['ret_input_embeddings.weight'].cpu().detach())

Which Naming rules used to induce ret_input_embeddings in the network in the source code ?

kohjingyu commented 1 year ago

You can produce ret_input_embeddings by extracting the trained [RET] token embeddings as such:

state_dict['ret_input_embeddings.weight'] = state_dict['model.input_embeddings.weight'][args.retrieval_token_idx].clone()

The benefit of doing this is that we save space as we don't need to retain the frozen OPT embeddings, we just need to save the [RET] one.