oborchers / Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
GNU General Public License v3.0
616 stars 83 forks source link

Does FSE guarantee ordering of vectors to be that of the input sentences? #36

Closed grantmwilliams closed 3 years ago

grantmwilliams commented 3 years ago

For an example like:

import pandas as pd

from fse.models import uSIF
from fse import SplitIndexedList
from gensim.models.keyedvectors import FastTextKeyedVectors

fasttext_model_path = "models/fasttext-wiki-news-subwords-300.model"
ft = FastTextKeyedVectors.load(fasttext_model_path)

sent_fp = "data/sentences/sentences.csv.gz"
df = pd.read_csv(sent_fp)

sentences = df.sentence.values

indexed_sentences = SplitIndexedList(sentences)

model = uSIF(ft, workers=2, lang_freq="en")

sentence_count, word_count = model.train(indexed_sentences)

embeddings = model.sv.vectors

Where I read in an ordered list of sentences and then process them through a pre-trained model, does FSE guarantee the order of the model vectors to be the same order that the sentences were fed in?

I didn't see anything in the documentation or source code to suggest they wouldn't be, but I also haven't seen in the documentation any claims for guaranteed ordering either.

Thanks!

oborchers commented 3 years ago

Hi @grantmwilliams,

the destination to where a sentence vector is written is completely dependent on the input. Each input is an iterable of (list[str], int), whereas int represents the target index. All input wrappers of type Indexed will always take the supplied order of inputs, whereas CIndexed can be used to supply a custom set of indices for many-to-one mappings.

I think this library requires some update by now, especially documentation