Open willbeason opened 1 week ago
Running analysis now - should take ~2 hours.
So far it looks like this is the case (~2% sample) - my application will terminate the first time it sees two different values for a documentContextAttributes
score and software-name.normalizedForm
pair in the same document.
And nope, not the same always:
processing "0bb12cfe-2e19-423a-bb7e-4ab64ef48647.pdf": shared score mismatch for "STRATA1": 0.000122 != 0.000001
Rerunning now to see how prevalent this is. So far I've found 2 instances in the first 5% of the data.
Found a total of 56 times where the scores don't match:
https://gist.github.com/willbeason/1045b219e8537ec1ba119b57c61a58bf
Every software
.mentions[]
has severals subfield underdocumentContextAttributes
. We suspect these are unique to the pair of (document, software-name.normalizedForm) rather than specific to (document, mention index). This would mean we need a separate table for these, if true.