huggingface / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
11.39k stars 709 forks source link

Cross Encoder #497

Closed achrafash closed 9 months ago

achrafash commented 9 months ago

Question

I'm trying to run this pre-trained Cross Encoder model (MS Marco TinyBERT) not available in Transformers.js.

I've managed to convert it using the handy script, and I'm successfully running it with the "feature-extraction" task:

const pairs = [
["How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."],
[ "How many people live in Berlin?", "Berlin is well known for its museums."]
];

const model = await pipeline("feature-extraction", modelName);
const out = await model(pairs[0]);

console.log(Array.from(out.data)) // [-8.387903213500977, -9.811422348022461]

But I'm trying to run it as a Cross Encoder model as it's intended to, like the Python example code:

from sentence_transformers import CrossEncoder

model_name = 'cross-encoder/ms-marco-TinyBERT-L-2-v2'
model = CrossEncoder(model_name, max_length=512)

scores = model.predict([
('How many people live in Berlin?', 'Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'), 
('How many people live in Berlin?', 'Berlin is well known for its museums.')
])

print(scores) // [ 7.1523685 -6.2870455]

How can I infer a similarity score from two sentences?

PS: if there are existing models/techniques for sentence similarity I'll take it!

xenova commented 9 months ago

Hi there :wave: Support for this model will be added in #501 (and released in v.2.13.4) :)

xenova commented 7 months ago

Just for future readers, here's a code sample which shows how to achieve the same behaviour as the new CrossEncoder.rank function in sentence-transformers:

import { AutoTokenizer, AutoModelForSequenceClassification } from '@xenova/transformers';

const model_id = 'mixedbread-ai/mxbai-rerank-xsmall-v1';
const model = await AutoModelForSequenceClassification.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);

/**
 * Performs ranking with the CrossEncoder on the given query and documents. Returns a sorted list with the document indices and scores.
 * @param {string} query A single query
 * @param {string[]} documents A list of documents
 * @param {Object} options Options for ranking
 * @param {number} [options.top_k=undefined] Return the top-k documents. If undefined, all documents are returned.
 * @param {number} [options.return_documents=false] If true, also returns the documents. If false, only returns the indices and scores.
 */
async function rank(query, documents, {
    top_k = undefined,
    return_documents = false,
} = {}) {
    const inputs = tokenizer(
        new Array(documents.length).fill(query),
        {
            text_pair: documents,
            padding: true,
            truncation: true,
        }
    )
    const { logits } = await model(inputs);
    return logits
        .sigmoid()
        .tolist()
        .map(([score], i) => ({
            corpus_id: i,
            score,
            ...(return_documents ? { text: documents[i] } : {})
        }))
        .sort((a, b) => b.score - a.score)
        .slice(0, top_k);
}

// Example usage:
const query = "Who wrote 'To Kill a Mockingbird'?"
const documents = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]

const results = await rank(query, documents, { return_documents: true, top_k: 3 });
console.log(results);
// [
//   {
//     corpus_id: 0,
//     score: 0.9930814504623413,
//     text: "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature."
//   },
//   {
//     corpus_id: 2,
//     score: 0.9839422106742859,
//     text: "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961."
//   },
//   {
//     corpus_id: 3,
//     score: 0.437057226896286,
//     text: 'Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.'
//   }
// ]