Open ashvardanian opened 2 months ago
Another open problem that I've recently discovered is the way strings are compared in Swift. By default, the language uses UTF8-aware normalization techniques when comparing strings. This is great for some applications, but horrible for tokenization, especially with multilingual models. I've solved that by introducing a LiteralString
wrapper for String
, that uses the literal comparators:
struct LiteralString: Hashable {
let value: String
static func ==(lhs: LiteralString, rhs: LiteralString) -> Bool {
return lhs.value.compare(rhs.value, options: .literal) == .orderedSame
}
func hash(into hasher: inout Hasher) {
hasher.combine(value)
}
}
I believe it should be applicable in other places as well. Let me know what you think, @pcuenca 🤗
Any chance of this being merged in soon? I'm trying to use a BERT model and this PR would be a huge help :)
Hi HF team!
I am extending our UForm repository of multimodal models to support Swift and mobile deployments, and along that way I've noticed that several classes for a broad range of BERT-like models are not yet supported by
swift-transformers
. So I've added aWordPieceDecoder
class and aliases forBertPreTokenizer
andBertProcessing
.Moreover, are you are well aware
config.json
andtokenizer.json
come in all shapes and sizes. So I've added fallback mechanisms to handle different tuple order in vocabulary listings.The current
main-dev
branch of UForm is already using this functionality from my fork. I am looking into integrating moreHub
functionality next. Please let me know what you think about this PR 🤗