Supporting more BERT-like models

ashvardanian commented 2 months ago

Hi HF team!

I am extending our UForm repository of multimodal models to support Swift and mobile deployments, and along that way I've noticed that several classes for a broad range of BERT-like models are not yet supported by swift-transformers. So I've added a WordPieceDecoder class and aliases for BertPreTokenizer and BertProcessing.

Moreover, are you are well aware config.json and tokenizer.json come in all shapes and sizes. So I've added fallback mechanisms to handle different tuple order in vocabulary listings.

The current main-dev branch of UForm is already using this functionality from my fork. I am looking into integrating more Hub functionality next. Please let me know what you think about this PR 🤗

ashvardanian commented 2 months ago

Another open problem that I've recently discovered is the way strings are compared in Swift. By default, the language uses UTF8-aware normalization techniques when comparing strings. This is great for some applications, but horrible for tokenization, especially with multilingual models. I've solved that by introducing a LiteralString wrapper for String, that uses the literal comparators:

    struct LiteralString: Hashable {
        let value: String

        static func ==(lhs: LiteralString, rhs: LiteralString) -> Bool {
            return lhs.value.compare(rhs.value, options: .literal) == .orderedSame
        }

        func hash(into hasher: inout Hasher) {
            hasher.combine(value)
        }
    }

I believe it should be applicable in other places as well. Let me know what you think, @pcuenca 🤗

ConfuseIous commented 1 month ago

Any chance of this being merged in soon? I'm trying to use a BERT model and this PR would be a huge help :)

huggingface / swift-transformers

Supporting more BERT-like models #89