text.split is not a function

Maxzurek commented 9 months ago

System Info

transformer.js version: 2.14.0 Framework: React (18.2.0) Browser: Chrome (120.0.6099.218) Node.js version: 20.2.0

Environment/Platform

[X] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

I am attempting to use the crossencoder-distilcamembert-mmarcoFR model as a re-ranker in React, following the provided model card and inference code. The original Python code using the transformers library functions as expected, but when translating it to React using transformer.js, I encounter a TypeError related to the PreTrainedTokenizer constructor.

Model Information

Original model: crossencoder-distilcamembert-mmarcoFR
ONNX model: crossencoder-distilcamembert-mmarcoFR_ONNX

Python code from the model card

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR')
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-distilcamembert-mmarcoFR')

pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')

model.eval()
with torch.no_grad():
    scores = model(**features).logits
print(scores)

Reproduction

Use the provided React code to load the model and tokenizer.
Attempt to tokenize a nested array as shown in the code snippet.
Observe the encountered error.

Code Snippet

useEffect(() => {
    const classifySequence = async () => {
        const modelName = "Oblix/crossencoder-camembert-base-mmarcoFR_ONNX";
        const model = await AutoModelForSequenceClassification.from_pretrained(modelName);
        const tokenizer = await AutoTokenizer.from_pretrained(modelName);
        const pairs = [
            ["Quelle est la capitale de la France?", "La capitale de la France est Paris"]
        ];
        try {
            const input = await tokenizer(pairs, {
                padding: true,
                truncation: true
            });
            const output = await model(input);
            console.log(output);
        } catch (error) {
            console.error("Error:", error);
        }
    };

    classifySequence();
}, []);

Error

xenova commented 9 months ago

Hi there 👋 Due to the differences in how JavaScript and Python handle optional positional and keyword arguments, we modified the API slightly to account for this. See here for example usage:

import { AutoTokenizer, AutoModelForSequenceClassification } from '@xenova/transformers';

const model = await AutoModelForSequenceClassification.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');

const features = tokenizer(
    ['How many people live in Berlin?', 'How many people live in Berlin?'],
    {
        text_pair: [
            'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.',
            'New York City is famous for the Metropolitan Museum of Art.',
        ],
        padding: true,
        truncation: true,
    }
)

const scores = await model(features)
console.log(scores);
// quantized:   [ 7.210887908935547, -11.559350967407227 ]
// unquantized: [ 7.235750675201416, -11.562294006347656 ]

Maxzurek commented 9 months ago

This is what is was looking for, although it doesn't seem to work for this specific model crossencoder-distilcamembert-mmarcoFR_ONNX (it does work for ms-marco-TinyBERT-L-2-v2 and all its MiniLM variants).

I will be using TinyBERT since it seems to be performing really well for my use case!

Just a quick question, is the example you gave me (or a similar one) available in the documentation? If not I think that would be a great addition since almost every cross-encoders need to tokenize text pairs.

Thank you for this great library by the way :heart:

sneko commented 6 months ago

@Maxzurek I guess I had the same issue than you when using https://huggingface.co/Oblix/crossencoder-distilcamembert-mmarcoFR_ONNX (logits property is returned undefined).

But just curious if you know why: what could explain this model cannot be used? Is it something related to the initial model https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR when built ? Or is Transformers.js not yet taking into account all cases for models?

By the way @xenova , thanks for porting ML stuff to the JS/TS environment, it's very valuable!

Maxzurek commented 6 months ago

@sneko It might be because the model is not yet supported by Transformers.js, but @xenova might be able to answer better. I've had good results so far with xenova/ms-marco-TinyBERT-L-2-v2, although I don't think the model is multilingual (I use it mostly for English and French).

huggingface / transformers.js