unitedstates / citation

Legal citation extractor, via command line, JavaScript, or HTTP. See a live example at:
https://uslaw.link
Creative Commons Zero v1.0 Universal
220 stars 42 forks source link

lack of validation on returned case citations #142

Open markmatney opened 6 years ago

markmatney commented 6 years ago

I am doing some experiments on some of the "Statutes at Large" search-able PDFs on FDsys. The text layer presumably contains raw OCR output, since it contains a lot of errors. I am extracting the text layer and sending it to cite-server running locally.

The following code snippets return false positives:

Citation.find("pursuant to 5 use 552(a)(1)(E) and") // "use" instead of "usc"
Citation.find("pursuant to 5 GARBAGE 552(a)(1)(E) and")
Citation.find("The sum of 27 and 42 is a number between 68 and 70.") // two citations found!

I am seeing the first case ("use") often where US Code citations in historical documents often omit periods in the abbreviation "USC" (see https://www.gpo.gov/fdsys/pkg/STATUTE-70/content-detail.html, open the PDF, search for the string "use", and see it highlighted often in the margins). I think the OCR engine that generated the text guessed "use", a word more common in everyday English than "USC". (Just in case, I'm NOT suggesting that it is the responsibility of the citation finder to anticipate and fix things like OCR errors.)

The last case has been popping up every once in a while, where you have a single word in between two numbers (see https://github.com/unitedstates/citation/issues/100).

Generally, the issue seems to be that citations of the reporter type are not being properly validated before being returned to the caller of Citation.find.

konklone commented 6 years ago

This is a great writeup of the problem, thank you! I'll take a look into this, though I don't have an ETA for it. If you're using this in something where time is of the essence, let me know -- and I'd welcome a pull request with a fix, if you have one.

mlissner commented 6 years ago

Weird. Seems like the regex here would only allow USC or U.S.C.:

https://github.com/unitedstates/citation/blob/master/citations/usc.js#L51

Is it being picked up as a U.S.C. citation?

markmatney commented 6 years ago

@konklone not time sensitive for me. I'm happy to collaborate on this issue though. I wouldn't be able to take it on entirely myself (not too knowledgeable about law and legal citations) but I am quite good with regular expressions. @mlissner each of the examples I mentioned are being interpreted as citations of type reporter