Open markmatney opened 6 years ago
This is a great writeup of the problem, thank you! I'll take a look into this, though I don't have an ETA for it. If you're using this in something where time is of the essence, let me know -- and I'd welcome a pull request with a fix, if you have one.
Weird. Seems like the regex here would only allow USC or U.S.C.:
https://github.com/unitedstates/citation/blob/master/citations/usc.js#L51
Is it being picked up as a U.S.C. citation?
@konklone not time sensitive for me. I'm happy to collaborate on this issue though. I wouldn't be able to take it on entirely myself (not too knowledgeable about law and legal citations) but I am quite good with regular expressions.
@mlissner each of the examples I mentioned are being interpreted as citations of type reporter
I am doing some experiments on some of the "Statutes at Large" search-able PDFs on FDsys. The text layer presumably contains raw OCR output, since it contains a lot of errors. I am extracting the text layer and sending it to
cite-server
running locally.The following code snippets return false positives:
I am seeing the first case ("use") often where US Code citations in historical documents often omit periods in the abbreviation "USC" (see https://www.gpo.gov/fdsys/pkg/STATUTE-70/content-detail.html, open the PDF, search for the string "use", and see it highlighted often in the margins). I think the OCR engine that generated the text guessed "use", a word more common in everyday English than "USC". (Just in case, I'm NOT suggesting that it is the responsibility of the citation finder to anticipate and fix things like OCR errors.)
The last case has been popping up every once in a while, where you have a single word in between two numbers (see https://github.com/unitedstates/citation/issues/100).
Generally, the issue seems to be that citations of the
reporter
type are not being properly validated before being returned to the caller ofCitation.find
.