mikeizbicki / modulus-magnus-linguae

8 stars 6 forks source link

Latin BERT pseudo perplexity difficulties #77

Open alysawyer opened 11 months ago

alysawyer commented 11 months ago

Hi! Based on how the Latin BERT code is written I think I might have to implement the pseudo perplexity myself. This is because the tokenizer they use isn't a derivative of some other more generic tokenizer but instead something that they created which means the generic code I found doesn't work since the tokenizer / model have different outputs / inputs than expected. I think it is possible to get the pseudo perplexity but it might take some time since the Latin BERT code isn't that documented so I first need to understand that. Please let me know if you have any ideas or if I'm missing anything.

I think it would potentially be worth looking into taking advantage of the existing features of Latin BERT (ie maybe use text infilling to get the probabilities of the possible words in the sentence and then use that to score based on where the answer is ordered relative to the other words in the word bank, but if not every word in the word bank is present in Latin BERT's dictionary that might require some additional fine tuning / create an issue.

mikeizbicki commented 11 months ago

Based on how the Latin BERT code is written I think I might have to implement the pseudo perplexity myself. This is because the tokenizer they use isn't a derivative of some other more generic tokenizer but instead something that they created which means the generic code I found doesn't work since the tokenizer / model have different outputs / inputs than expected. I think it is possible to get the pseudo perplexity but it might take some time since the Latin BERT code isn't that documented so I first need to understand that. Please let me know if you have any ideas or if I'm missing anything.

This makes sense, but you should probably be able to find a generic pseudocomplexity implementation that lets you specify the tokenizer.

I don't know if this will work for the LatinBERT use case, but I found this SO question with a pseudoperplexity implementation that lets you specify the tokenizer: https://stackoverflow.com/questions/70464428/how-to-calculate-perplexity-of-a-sentence-using-huggingface-masked-language-mode

I think it would potentially be worth looking into taking advantage of the existing features of Latin BERT (ie maybe use text infilling to get the probabilities of the possible words in the sentence and then use that to score based on where the answer is ordered relative to the other words in the word bank, but if not every word in the word bank is present in Latin BERT's dictionary that might require some additional fine tuning / create an issue.

The main disadvantage of this approach is that most words will consist of multiple tokens. (For example the word linguae is likely to be broken down into something like lingu and ae, one token for the root and one for the grammatical ending marker.) So you'll have to account for this, which isn't too bad, but once you are accounting for this, you've basically just implemented the pseudocomplexity yourself.

alysawyer commented 11 months ago

Ok that makes sense thanks! I will start by looking into the first for sure and if that doesn't work then I'll try the second.

alysawyer commented 11 months ago

Hi -- just started looking into this and I realized the link you sent me was the SO I was referencing when trying to get this to work but I don't think it was working for the reason I mentioned above (when the function says tokenizer it's expecting a function with specific inputs/outputs that the Latin BERT tokenizer isn't). I tried to reverse engineer how the Latin BERT tokenizer worked to get it to work with this function but that wasn't really working / taking a long time.

alysawyer commented 11 months ago

Do you have any time to meet with me individually this week or early next? I think I'm missing some conceptual understanding regarding how the latin BERT code works that I think is slowing me down.

mikeizbicki commented 11 months ago

I'm on zoom right now (not sure how long I'll stay on) while I'm looking at this a bit more. Otherwise I could meet at 1pm tomorrow.

mikeizbicki commented 11 months ago

My quick glance through the code makes it look like the tokenization is happening at https://github.com/dbamman/latin-bert/blob/62bcb3133055da38d3317e095c276de696bfdad0/scripts/gen_berts.py#L219

There are 2 problems with this code that prevents it from being used directly with the SO code:

The simple problem is that the function is named differently.

More complex, I believe this is returning a list of tokenids rather than a numpy array, which the SO expects. To fix this, we need to dig into where the latinbert code calls their .tokenize() function and how they are generating the tensors before passing to the pytorch code.

The method is called in a handful of places, but I believe the .get_batches method is the important one here: https://github.com/dbamman/latin-bert/blob/62bcb3133055da38d3317e095c276de696bfdad0/scripts/gen_berts.py#L26C20-L26C29

This function (seems to be) returning the actual pytorch tensors that will be used in the training/eval loops, which is basically what the SO function expects the .encode function to give from its output. Unfortunately, the types still aren't 100% the same, and so some digging through the actual code to connect these two pieces together would be required.

This is an admittedly daunting task if you haven't worked with pytorch much. I'll look to see if I can think of an alternative.