mention clustering analysis

At this end I've been exploring the impact of moving to a "small window" content analysis approach, rather than coding full papers. See https://github.com/howisonlab/softcite-dataset/blob/master/code/clustering_analysis.Rmd which is complied at https://github.com/howisonlab/softcite-dataset/blob/master/code/clustering_analysis.html (download and open as html if interested.)

The pic here is the result of taking samples (of different sizes) from the list of software mentions (called seeds), then using different page windows (ie mentions on same page, mentions on 3 page window centered on seed etc. This is basically a measure of clustering of the mentions, if you look at the seed percentage of .25, you can see that (unsurprisingly) 0.25 of mentions are found just from the seeds (pages read == 0), adding in those on the same page as any seed, you jump to finding 50%, and if you read 3 pages you get to 60%. The reason that you never get to 100% even when expanding the window to include the full length of the papers is that if a paper didn't have any mentions of the randomly chosen seeds then that article is never "read" because there are no starting points.

mention_clustering

I think this shows a fairly high false negative rate (like 40% of known mentions are missed) but would rapidly speed up the number of positive examples we have (and reduce the coding effort massively). What will be really telling (and definitely publishable) is to test the precision/recall effectiveness of a system trained with data obtained via this bootstrapping approach. ie perhaps finding and coding just 60% of mentions is enough to train a system to equivalent success as finding 100%.

So @kermitt2, would it be straight-forward to train the model with the subsets found at each of these seed_percent and expand_window levels (or at least 0:5), and record different recall/precision numbers?

Currently, for performance, I don't have a record of exactly which selections were found (just a count), but if I recorded a list of selections for each experiment, could you train using just those that were found, then record performance of the model trained just on those? I know that it would be tons of computation, because we'd be training and testing the model many times, how long does it take to train and test the models at the moment?

It should be doable. The current complete training time is around 3 hours for CRF, and a bit more than 1 hour for Deep Learning models (if I remember well - I have a good GPU). For CRF, I can use larger computing power.

I see than we have 20 steps for the seed_pourcent and 9 window size, so that would be 180 training? In worst case, approx. 20 days training without moving to better CPU and without considering that each of these training would use clearly less training examples.

However, it's important to note that the current evaluations are not very reliable - in particular in absolute terms, simply because, like the training data, the evaluation data has a very low inter-annotator agreement rate. For realizing this kind of exercise, we need to reach a good level of IAA and a consistency in the annotations. Another aspect is that the models/features need to be already well tuned, which also supposes some reliable training data.

howisonlab / softcite-dataset

mention clustering analysis #570