sfomuseum / accession-numbers

Machine-readable regular expressions for identifying accession numbers for cultural heritage organizations in text.
Other
6 stars 3 forks source link

Add americanart.si.edu #44

Closed ericpugh closed 2 years ago

ericpugh commented 2 years ago

Hi, I'm trying to add americanart.si.edu, but I'm having some issues getting the regex to pass the Linux test-runner. Confusingly, the test-runner says it fails on a different accession number each time it's run against the same pattern.

thisisaaronland commented 2 years ago

First: Yay, thanks!

The reason you're getting a different pattern each time is because the tests are defined as a dictionary (hash map) which are unordered lists in Go so when you range over them the sort order will rarely be the same.

It looks like the problem is that some of the inner matches are greedy while the outer match is non-greedy. I tried with the following:

((?:XX\\d+[A-Z]{0,3}\\d?)|(?:\\d+\\.(?:\\d+)\\.?(?:\\d+(?:[A-Z])?(?:-[A-Z])?)?(?:(?:\\.|,)(?:\\d+))?(?:(?:[A-Z-]-[A-Z])+)?))

And it seems to work:

$> bin/darwin/test-runner data/americanart.si.edu.json
2022/01/06 16:23:41 [https://americanart.si.edu/] OK 1994.18.175
2022/01/06 16:23:41 [https://americanart.si.edu/] OK 1985.55
2022/01/06 16:23:41 [https://americanart.si.edu/] OK 1972.167.30A
2022/01/06 16:23:41 [https://americanart.si.edu/] OK XX105
2022/01/06 16:23:41 [https://americanart.si.edu/] OK 1985.66.387,773
2022/01/06 16:23:41 [https://americanart.si.edu/] OK 1967.59.436R-V
2022/01/06 16:23:41 [https://americanart.si.edu/] OK This is an object\nGift of Important Donor\n1994.18.175\n\nThis is another object\nAnonymous Gift\n1980.97A-C 1929.7.167V\nXX98H  \nOil on canvas
2022/01/06 16:23:41 All tests pass for Smithsonian American Art Museum and Renwick Gallery
ericpugh commented 2 years ago

@thisisaaronland I've updated. Thanks for your help, that's a ugly looking regex.

thisisaaronland commented 2 years ago

Thanks!