howisonlab / software-mentions-dataset-analysis

Analyses of software mentions and dependencies
GNU General Public License v3.0
3 stars 0 forks source link

Are documentAttributes confidence values identical across software-name.normalizedForm? #15

Open willbeason opened 1 week ago

willbeason commented 1 week ago

Every software .mentions[] has severals subfield under documentContextAttributes. We suspect these are unique to the pair of (document, software-name.normalizedForm) rather than specific to (document, mention index). This would mean we need a separate table for these, if true.

willbeason commented 1 week ago

Running analysis now - should take ~2 hours.

So far it looks like this is the case (~2% sample) - my application will terminate the first time it sees two different values for a documentContextAttributes score and software-name.normalizedForm pair in the same document.

willbeason commented 1 week ago

And nope, not the same always:

processing "0bb12cfe-2e19-423a-bb7e-4ab64ef48647.pdf": shared score mismatch for "STRATA1": 0.000122 != 0.000001
willbeason commented 1 week ago

Rerunning now to see how prevalent this is. So far I've found 2 instances in the first 5% of the data.

willbeason commented 1 week ago

Found a total of 56 times where the scores don't match:

https://gist.github.com/willbeason/1045b219e8537ec1ba119b57c61a58bf