pulibrary / orangeindex

Traject instance for indexing MARC into Solr
2 stars 0 forks source link

normalize data for browse lists #42

Open joycebcat opened 8 years ago

joycebcat commented 8 years ago

@tampakis @jpstroop The data used for the browse lists should be normalized.

shakespeare, William, 1564 1616 Shakespeare, William , 1564-1616 Shakespeare, William 1564-1616 Shakespeare, William, 1564-1616.

These should all be grouped together. Capitalization, punctuation and extra spaces should not generate separate entries. Diacritics should be normalized also. The lines below should be grouped together. (There will be a lot of this sort of thing since diacritics used to be routinely left off capital letters before the rules changed.)

École biblique et archéologique française Ecole biblique et archéologique française

jpstroop commented 8 years ago

We can manage some of this, but shouldn't the data ultimately be cleaned up as well?

My first thought about collapsing École biblique et archéologique française and Ecole biblique et archéologique française is that we'd likely have to just strip all diacritics because I'm not sure how we'd know which is correct.

joycebcat commented 8 years ago

Yes, data cleanup is desirable. If you can log it, we can clean it. There will be a lot because of things like the rule change mentioned previously.

Voyager handles the difference in display forms by displaying the form from the most recently saved record. Would we be able to choose a form based on some arbitrary rule like that?

jpstroop commented 8 years ago

They may even be tricky to log; I suppose we could try to come with with an algorithm the looks at the distance between the adjacent strings to surface hot spots in the list (we could do this anywhere, it doesn't have to be on the production system, and probably shouldn't be). But even then we wouldn't know which version is correct.

Now that there's an easy to browse list, I wonder if this could be a student "sitting job"?

jpstroop commented 8 years ago

@tampakis pointed out that we could write a query that would return records whose sort-normalized version is the same. That might be a way to start.

mzelesky commented 8 years ago

With Regular Expressions, we could develop a set of patterns to search for on a periodic basis (perhaps every time all records are re-indexed for Blacklight). One I'm looking at right now is "tag='[17][0]{2}'>[a-z]{3,},", which will find any 100 or 700 fields where the first subfield is 'a', where subfield a begins with 3 or more lowercase letters, followed by a comma. In the entire database, I found 181 records that matched that pattern.