Closed plonerma closed 10 months ago
Name | Link |
---|---|
Latest commit | d80d8444b57a848f6b581d67f137b71392eb4d7e |
Latest deploy log | https://app.netlify.com/sites/pyminicons/deploys/654a58112b57c600099f1365 |
Name | Link |
---|---|
Latest commit | a541824f922c7d1d45288e3aac842f9726ae3e86 |
Latest deploy log | https://app.netlify.com/sites/pyminicons/deploys/654df56773ef6400085bee1e |
Oops I should have probably merged this before I made a new change to scorer.py -- do you mind integrating my new changes and re-submitting? This is awesome btw, thank you so much!!
No problem. I will adapt it probably tomorrow. Thanks for developing the framework and accepting the change!
great -- sorry again for not first merging this!
Hey, I merged your master into my branch and additionally added the suffix-option for conditional MLM-scores (I hope that's in your interest).
Example usage:
from typing import List
from minicons import scorer
model = scorer.MaskedLMScorer('distilbert-base-cased', None)
prefixes = [
"The traveler lost",
"The traveler lost",
]
stimuli = [
"the souvenir",
"interest"
]
suffixes = [
"at the market.",
"at the market."
]
complete_sentences: List[str] = [f"{pre} {stim} {suff}" for pre, stim, suff in zip(prefixes, stimuli, suffixes)]
def reduction(t):
return t.sum().item()
for PLL_metric in ("original", "within_word_l2r"):
print("---", PLL_metric, "---")
print("Individual tokens:")
for sentence in model.token_score(complete_sentences, PLL_metric=PLL_metric):
print(" ".join((f"{t} ({s})" for t, s in sentence)))
print("Complete sequence:", model.sequence_score(complete_sentences, PLL_metric=PLL_metric, reduction=reduction))
print("Conditional:", model.conditional_score(prefix=prefixes, stimuli=stimuli, suffix=suffixes, PLL_metric=PLL_metric, reduction=reduction))
print("\n")
Produces:
--- original ---
Individual tokens:
The (-2.931204319000244) travel (-3.1608409881591797) ##er (-4.340202808380127) lost (-10.719362258911133) the (-2.783437728881836) so (-0.018465042114257812) ##uve (-2.09808349609375e-05) ##nir (0.0) at (-2.0171499252319336) the (-1.7253851890563965) market (-5.643357276916504) . (-0.3891754150390625)
The (-3.215424060821533) travel (-4.790759563446045) ##er (-5.153533935546875) lost (-5.166162490844727) interest (-3.1110496520996094) at (-3.688335418701172) the (-1.3834552764892578) market (-6.61713171005249) . (-0.44433021545410156)
Complete sequence: [-33.728601932525635, -33.57018232345581]
Conditional: [-2.8019237518310547, -3.1110496520996094]
--- within_word_l2r ---
Individual tokens:
The (-2.931204319000244) travel (-8.166111946105957) ##er (-4.340202808380127) lost (-10.719362258911133) the (-2.783437728881836) so (-8.323075294494629) ##uve (-2.5555038452148438) ##nir (0.0) at (-2.0171499252319336) the (-1.7253851890563965) market (-5.643357276916504) . (-0.3891754150390625)
The (-3.215424060821533) travel (-9.80713939666748) ##er (-5.153533935546875) lost (-5.166162490844727) interest (-3.1110496520996094) at (-3.688335418701172) the (-1.3834552764892578) market (-6.61713171005249) . (-0.44433021545410156)
Complete sequence: [-49.593966007232666, -38.586562156677246]
Conditional: [-13.662016868591309, -3.1110496520996094]
Currently it is not possible to use the
withing_word_l2r
strategy in theconditional_score
ofMaskedLMScorer
. This PR fixes this by using a masking function which is shared betweenprepare_text
andprime_text
(and additionally sets up the possibility of using non-masked suffixes in the MLM scorer).This allows the following usage:
With the output: