Out of Sample Prediction for Wordfish

quanteda / quanteda.textmodels

Text scaling and classification models for quanteda

42 stars 6 forks source link

Out of Sample Prediction for Wordfish #5

Open muhark opened 5 years ago

muhark commented 5 years ago

OOS Predict Wordfish

Hi, two (related) questions. It says in the documentation for textmodel_wordfish that out-of-sample prediction is not currently supported; does this mean that the feature may be added in the future? If so, I'd be happy to submit a pull request/get involved in implementing it. Second question; does the fitting of wordfish require that there are no features with zero occurrence? I imagine that if we can fit the model with the union of both corpuses (corpi?) from the training and prediction set, then could it be that the prediction task would be as trivial as "plugging in" beta and psi then recovering theta (and alpha) accordingly?

Sorry if this question has been asked before/if it's silly; I've only just started working with quanteda.

kbenoit commented 5 years ago

Wordfish cannot fit zero-occurrence features, as there are undefined. (There are an infinite number of zero-occurrence features that could be counted otherwise.) But an OOS prediction method could do the same as the predict methods for other textmodel_*() functions, and make the newdata dfm conform to that from the fitted model. (This would mean considering features present in x but not in newdata as occurring zero times in newdata, and dropping features present in newdata but not in x.)

@conjugateprior I think has implemented a predict method for wordfish already. Thoughts, Will?

conjugateprior commented 5 years ago

Getting point estimates might well be 'as trivial as "plugging in" beta and psi then recovering theta (and alpha) accordingly', although you'd probably want to flip to multinomial form first.

There's a bit more work to be done about getting prediction intervals though. The easiest asymptotic standard errors for item parameters would assume that the ideal points are perfectly measured (which seems not altogether unreasonable for very large vocabularies) and could be constructed numerically, then you could sample to get uncertainty around the ideal points for OOS docs.

The practical issues would be deciding what to do if the two dfms had different preprocessing.