I doubled the size of the training set, and improved the feature set. The logistic scorer now seems to clearly outperform the numDerivations baseline, at least according to cross validation over the training data (see google doc here for before/after/baseline curves). The results look pretty good subjectively, too.
I added a feature that issues a count query to the triplestore to gauge how common an argument is - this seemed to provide a large boost. All of the features are still very general. I think there is even more to be gained by looking at relation overlap between query and results - I'll check that out next. In the meantime, I thought the scorer was probably good enough to include in the WebRepl, so I thought I'd send this pull request. I also put up a demo at kupo:8080 that has the latest classifier if you want to check it out.
Hi Tony,
I doubled the size of the training set, and improved the feature set. The logistic scorer now seems to clearly outperform the numDerivations baseline, at least according to cross validation over the training data (see google doc here for before/after/baseline curves). The results look pretty good subjectively, too.
I added a feature that issues a count query to the triplestore to gauge how common an argument is - this seemed to provide a large boost. All of the features are still very general. I think there is even more to be gained by looking at relation overlap between query and results - I'll check that out next. In the meantime, I thought the scorer was probably good enough to include in the WebRepl, so I thought I'd send this pull request. I also put up a demo at kupo:8080 that has the latest classifier if you want to check it out.