Open lmullen opened 2 years ago
Additionally, she calculated that eyecite found nearly a million cases we have not matched so far. The table of cites is attached here and I'll start going through it to figure out why we missed these. On a cursory glance, I would guess most of these simply aren't on our whitelist, like "Wash. C. C." which appears in this table over 17,000 times.
Tables are too big. Can be found here: https://drive.google.com/drive/folders/19l8aVcdVPbZjUqqymVNDOm8fshel1gJf?usp=sharing
I think we can easily add more things to the white list if need be. I would be curious to know if there are more systematic problems, however.
What about the inverse question? Are there cases, if so how many, we found that eye cite did not?
But really, what this points to is just creating a union of eye cite cases plus whitelist cases, which will be better than either method individually. We don't really care how we get there, as long as the cases are known to be good.
See first comment above. She thinks we found 4 million cites eyecite didn't. So far.
My RA believes she has successfully run eyecite across the MOML corpus. Deduplicating cites to the same case by the same treatise, she found around 4 million edges, about half what we found through our whitelisting method (I think our number is 8.2 million). I'll upload her table below.