huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.08k stars 26.31k forks source link

Add new CANINE model #11016

Closed stefan-it closed 3 years ago

stefan-it commented 3 years ago

🌟 New model addition

Model description

Google recently proposed a new Character Architecture with No tokenization In Neural Encoders architecture (CANINE). Not only the title is exciting:

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Overview of the architecture:

outputname-1

Paper is available here.

We heavily need this architecture in Transformers (RIP subword tokenization)!

The first author (Jonathan Clark) said on Twitter that the model and code will be released in April :partying_face:

Open source status

stefan-it commented 3 years ago

Update on that: model and checkpoints are released:

https://github.com/google-research/language/tree/master/language/canine

:hugs:

iknoorjobs commented 3 years ago

Hi @stefan-it, thanks for the update. Do you know how we can use those pre-trained tensorflow checkpoints to get the pooled text representations from CANINE model? Thanks

krrishdholakia commented 3 years ago

any updates on this ?

NielsRogge commented 3 years ago

Hi,

I've started working on this. Forward pass in PyTorch is working, and giving me the same output tensors as the TF implementation on the same input data.

Will open a PR soon

stefan-it commented 3 years ago

Hi @dhgarrette,

I don't want to spam the CANINE PR with this question/discussion, so I'm asking it here in this issue 😅

So I would like to use CANINE for token classification (I'm currently implementing it into Flair framwork...), and for that reason tokenized input is passed to the model. For token classification using e.g. BERT one would use the first subword as "pooling strategy". But when using CANINE and following the subword "analogy", using the embedding of the first - let's say - character is a good strategy (instead of e.g. mean) 🤔