Testing word-by-word surprisal in languages other than English

kanishkamisra / minicons

Utility for behavioral and representational analyses of Language Models

https://minicons.kanishka.website

MIT License

122 stars 29 forks source link

Testing word-by-word surprisal in languages other than English #3

Closed matakahas closed 3 years ago

matakahas commented 3 years ago

Hello,

Thank you for making this amazing work available! I am wondering if the code works for languages other than English. I was following your example and was able to load English and Japanese models without an error.

model_jp = scorer.IncrementalLMScorer("colorfulscoop/gpt2-small-ja", 'cpu')
model_en = scorer.IncrementalLMScorer("gpt2", 'cpu')

But when I ran the line model_jp.logprobs(model_jp.prepare_text([text])) with the Japanese model, it threw the following error:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in tokens(self, batch_index)
    293         """
    294         if not self._encodings:
--> 295             raise ValueError("tokens() is not available when using Python-based tokenizers")
    296         return self._encodings[batch_index].tokens
    297 

ValueError: tokens() is not available when using Python-based tokenizers

I would appreciate it if you could point me to any possible solutions, and I apologize if it this is a very basic question. Thank you!

kanishkamisra commented 3 years ago

Hey! Thank you for using the library! Unfortunately I feel this might be an Huggingface Transformers issue. However, if you send me a minimal example of the input I can try and figure out what is going on. My best guess is that the model's tokenizer might not be working properly/is buggy.

kanishkamisra commented 3 years ago

Update: I would recommend upgrading the package. Run pip install --upgrade minicons and using the function model.compute_stats() instead of logprobs(). I am working on a documentation website and it should be finished soon. Apologies for my poor documentation skills!

By incorporating the above changes, I am able to run the model properly. I used this (I am unsure what the text means, I pasted it from https://huggingface.co/colorfulscoop/gpt2-small-ja):

from minicons import scorer

model_jp = scorer.IncrementalLMScorer('colorfulscoop/gpt2-small-ja')
text = '統計的機械学習でのニューラルネットワーク'

model_jp.compute_stats(model_jp.prepare_text(text))

I have updated the (example)[https://github.com/kanishkamisra/minicons/blob/master/examples/surprisals.md] with the newer changes. Thanks again for posting an issue!

matakahas commented 3 years ago

Thank you so much for your quick response! Yes, the code worked on my environment as well. Screen Shot 2021-10-25 at 6 15 35 PM

Is there a way for the model.compute_stats() function to also return how the sentence was tokenized, like in the model.logprobs() function? I will see if I can modify the code to print that out, but please let me know if there is an easy fix. Again, thank you for taking the time to answer my question!

kanishkamisra commented 3 years ago

Oh yes, you can do it by using model.token_score(text) (no need for preparing the text as it's a convenient wrapper around compute_stats and prepare_text).

I have also updated my tutorial: https://github.com/kanishkamisra/minicons/blob/master/examples/surprisals.md