ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.16k stars 1.19k forks source link

Impossibility to use a tokenizer with auto_transformer #3894

Open sergsb opened 9 months ago

sergsb commented 9 months ago

I want to use this model as an encoder. As you can see from the description, the model can be uploaded like:

model = AutoModel.from_pretrained("ibm/MoLFormer-XL-both-10pct", deterministic_eval=True, trust_remote_code=True) 
tokenizer = AutoTokenizer.from_pretrained("ibm/MoLFormer-XL-both-10pct", trust_remote_code=True)

I try to load it using

encoder: auto_transformer
   pretrained_model_name_or_path: ibm/MoLFormer-XL-both-10pct

It results in RuntimeError: Caught exception during model preprocessing: Tokenizer class MolformerTokenizer does not exist or is not currently imported. This is not surprising, because this model does not use the specific MolformerTokenizer but AutoTokenizer instead.

However, the documentation says that "If a text feature's encoder specifies a huggingface model, then the tokenizer for that model will be used automatically.".

How can I load the tokenizer for this model?

sergsb commented 9 months ago

I found out that the problem is with trust_remote_code, which is also mandatory for loading tokenizers. see also https://github.com/ludwig-ai/ludwig/pull/3632

justinxzhao commented 9 months ago

Hi @sergsb,

Thanks for sharing your experience.

The Ludwig team is focused on building first class support for natively supported models on HF. As I understand, supporting models that require trust_remote_code=True is tenable, but carries other risks that need to be thought through.

CC: @arnavgarg1

sergsb commented 9 months ago

Hi @justinxzhao,

Thanks for the answer. Maybe an option would be introducing a global config parameter, trust_remote_code, and set it to HF models and tokenizers?

justinxzhao commented 9 months ago

@sergsb that seems reasonable to me. I think that's what @arnavgarg1 was going for in https://github.com/ludwig-ai/ludwig/pull/3632, specifically here.