apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html
Hi there, I'm trying to use apricot to help find a diverse set of texts. When I use the fit method, everything works intuitively. However, when I start using the partial_fit method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.
from textdiversity import POSSequenceDiversity
from apricot import FacilityLocationSelection
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
def test_apricot(featurizer, texts, fit_type="full_fit", batch_size = 2):
selector = FacilityLocationSelection(
n_samples=len(texts),
metric='euclidean',
optimizer='lazy')
if fit_type == "full_fit":
f, c = d.extract_features(texts)
Z = d.calculate_similarities(f)
selector.fit(Z)
elif fit_type == "unbatched_partial":
f, c = d.extract_features(texts)
Z = d.calculate_similarities(f)
selector.partial_fit(Z)
elif fit_type == "batched_partial":
for batch in chunker(texts, batch_size):
f, c = d.extract_features(batch)
Z = d.calculate_similarities(f)
selector.partial_fit(Z)
print(f"{fit_type} ranking: {selector.ranking} | gain: {sum(selector.gains)}")
# test ====================================================
d = POSSequenceDiversity()
texts = ["This is a test.",
"This is also a test.",
"This is the real deal.",
"So is this one."]
test_apricot(d, texts, "full_fit") # > ranking: [0 3 1 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [2 3] | gain: 0.4444444444444444
texts = ["This is the real deal.",
"So is this one.",
"This is a test.",
"This is also a test."]
test_apricot(d, texts, "full_fit") # > ranking: [0 1 3 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [0 1] | gain: 0.5
Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc.
Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the partial_fit method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know.
Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect the n_samples parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).
Thanks for taking the time to read + potentially helping me out.
Hi there, I'm trying to use
apricot
to help find a diverse set of texts. When I use thefit
method, everything works intuitively. However, when I start using thepartial_fit
method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc. Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the
partial_fit
method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know. Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect then_samples
parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).Thanks for taking the time to read + potentially helping me out.