Open joycebcat opened 8 years ago
We can manage some of this, but shouldn't the data ultimately be cleaned up as well?
My first thought about collapsing École biblique et archéologique française and Ecole biblique et archéologique française is that we'd likely have to just strip all diacritics because I'm not sure how we'd know which is correct.
Yes, data cleanup is desirable. If you can log it, we can clean it. There will be a lot because of things like the rule change mentioned previously.
Voyager handles the difference in display forms by displaying the form from the most recently saved record. Would we be able to choose a form based on some arbitrary rule like that?
They may even be tricky to log; I suppose we could try to come with with an algorithm the looks at the distance between the adjacent strings to surface hot spots in the list (we could do this anywhere, it doesn't have to be on the production system, and probably shouldn't be). But even then we wouldn't know which version is correct.
Now that there's an easy to browse list, I wonder if this could be a student "sitting job"?
@tampakis pointed out that we could write a query that would return records whose sort-normalized version is the same. That might be a way to start.
With Regular Expressions, we could develop a set of patterns to search for on a periodic basis (perhaps every time all records are re-indexed for Blacklight). One I'm looking at right now is "tag='[17][0]{2}'>
@tampakis @jpstroop The data used for the browse lists should be normalized.
shakespeare, William, 1564 1616 Shakespeare, William , 1564-1616 Shakespeare, William 1564-1616 Shakespeare, William, 1564-1616.
These should all be grouped together. Capitalization, punctuation and extra spaces should not generate separate entries. Diacritics should be normalized also. The lines below should be grouped together. (There will be a lot of this sort of thing since diacritics used to be routinely left off capital letters before the rules changed.)
École biblique et archéologique française Ecole biblique et archéologique française