sharder996 / nagoya-obi

Python port of nagoya obi reading difficulty statistic by Satoshi Sato
1 stars 0 forks source link

Question about the likelihood computation #1

Open nilaykumar opened 1 year ago

nilaykumar commented 1 year ago

Hey Scott --

First of all: thanks for your work on this port. The web version of the tool seems to be down, so this repo was actually the only place I could find code/data files for Sato et al's project. This isn't really a question about the code, so I apologize if this is a bit off-topic. I just figured you might have an idea of how the computation is set up.

In the calculate_likelihoods function of nagoyaobi.py the likelihood for a given character key of the input text being in the ith grade level is computed as text[key][i] = text[key][0] * self.model[key][i]. This is in line with the the summand in equation (8) of Sato-Matsuyoshi-Kondoh's paper, so no problems there. The second factor in this product should be $\log P(z \mid G_i)$, which, being the log of a probability, should be non-positive. The values in the self.model dictionary, however, are pulled from Obi2-T13.model, which has a number of positive values. So it's not clear to me that the code (even in the ruby code) is doing exactly what is described in the paper. Am I misunderstanding how the computation's supposed to go?

sharder996 commented 1 year ago

Hi Nilay, thanks for your interest in my work!

Yes, you are correct that the log of a probability should always be non-positive! Unfortunately, I did not create the model and had data mined it, so I can't speak to reason why some of the values are positive.

This project was simply for the benefit of my own Japanese study with the intent of rating the reading difficulty of ebooks. However, if you have any improvements to what I've started here feel free to let me know!