sfomuseum / accession-numbers

Machine-readable regular expressions for identifying accession numbers for cultural heritage organizations in text.
Other
6 stars 3 forks source link

Update AIC pattern #17

Closed thisisaaronland closed 2 years ago

thisisaaronland commented 2 years ago

Tests were recently updated to use a "tail-buffer"-based lookup to account for text with multiple accession numbers and arbitrary text (aka a "wall label"):

"This is an object\\nGift of Important Donor\\n302.2021.x1-x2\\n\\nThis is another object\\nAnonymouts Gift\\nPG731.2019 146.2020": 3

This breaks the AIC test for Obj: 96681 (which has been given an expected count of -1 so as not to stop other tests):

> make tests
env GOOS=darwin GOARCH=amd64 go build -o bin/darwin/test-runner cmd/test-runner/main.go
env GOOS=linux GOARCH=amd64 go build -o bin/linux/test-runner cmd/test-runner/main.go
env GOOS=windows GOARCH=amd64 go build -o bin/windows/test-runner cmd/test-runner/main.go
bin/darwin/test-runner data/*.json

2021/11/16 11:15:54 Failed to run tests for Art Institute of Chicago, Failed to find matches for 'Obj: 96681' using '((?:[RXTNObjf0-9: ]+)\/?(?:[0-9\.\-,]*)|(?:\d{4})\.(?:\d+[a-z\-]*))' (Art Institute of Chicago), expected 1 matches but got 2 ([Obj: 96681])
make: *** [tests] Error 1

@nikhiltri Any thoughts on the best approach here?

nikhiltri commented 2 years ago

Hey @thisisaaronland, I'm looking into refactoring the regex to account for this use case again. In the meantime, I wonder about this line that wraps each of the regexes in the tests:

re_pat := fmt.Sprintf(".*?%s", pat)

I worry that the uber-eager .* will find matches in all of the tests whether or not they actually match the given regex pattern. While I'm looking closer into this, I suspect some of my tests should actually fail, but they're all passing.

Perhaps the ^$ patter can be used to lock the match at the end of the string, if that's what you're going for?

thisisaaronland commented 2 years ago

Curiously, adding a trailing $ causes a few more tests (for example 53.87a-i with the Met) to fail.

The thing about the regular expression as it's written is that it assumes the last part of any string being tested will contain an accession number (or not). That is: There won't be any text after a matching accession number.

thisisaaronland commented 2 years ago

Update: The Met tests were failing because of an un-escaped \w. Good times...

thisisaaronland commented 2 years ago

That said, adding ^...$ to the regular expression causes this AIC accession number to fail: RX16751/1844.a, whereas without the start/finish flags it works fine.

I also added a "multi-line" test for AIC which is flagged as -1 (to skip) since it fails with the following error:

2021/11/16 13:21:48 Failed to run tests for Art Institute of Chicago, Failed to find matches for 'This is an object\nGift of Important Donor\nX78\n\nThis is another object\nAnonymouts Gift\nRX16751/1844.a' using '((?:[RXTNObjf0-9: ]+)\/?(?:[0-9\.\-,]*)|(?:\d{4})\.(?:\d+[a-z\-]*))' (Art Institute of Chicago), expected 2 matches but got 12 ([T   bj f f   X78  T   bj   RX16751/1844.])
nikhiltri commented 2 years ago

I think some of the logic is being too generous. It's matching "Obj:" as a complete accession number, then moving on to try to find moar matches. I split out "Obj: ######" to it's own portion of the regex since it's kind of a wonky use case.

nikhiltri commented 2 years ago

Sooooo I ran this regex against every published accession number in our API, and I only found three objects that it doesn't match:

  1. 1900.678 - 685
  2. 1925.2687, 1930.385, 1934.250-252
  3. 0158 (0096TS)

These are admittedly wonky. For the purpose of this project they might not be examples you want to include. But note that all of these have spaces. The logic here that says "if after hitting a space the string doesn't match the regex keep adding onto the string" might not work since the first portion of example one is indeed a whole and complete accession number.

We probably aren't the only institution with IDs like this, so perhaps you could pile up a few more of these examples before refactoring how to deal with spaces in accession numbers.

thisisaaronland commented 2 years ago

Awesome, thank you. I will look at all this in the morning.

This work will yield so many conference talks...

thisisaaronland commented 2 years ago

The "tail-buffer" code has been refactored to read ahead until it fails there is an initial match:

https://github.com/sfomuseum/accession-numbers/blob/refactor-tests/cmd/test-runner/main.go#L151-L177

This coincides with reformatting the tests to include expected positional values. So far I've only done SFO Museum:

https://github.com/sfomuseum/accession-numbers/blob/refactor-tests/data/sfomuseum.json

I will update the existing tests before merging the changes in to main. In principle, at least, these changes should match the AIC accession numbers with white space.

Accession numbers, amirite...

thisisaaronland commented 2 years ago

The refactor-tests branch has been merged in to main. There are still some failing tests for AIC but those have been marked as "to skip" (read: an empty array of expected values) for the time being.

thisisaaronland commented 2 years ago

Fixed as part of #36