Group by collection header breaks when EAD top-level isn't a collection

seanaery commented 5 years ago

Example in ArcLight Demo: https://arclight-demo.projectblacklight.org/?utf8=%E2%9C%93&group=true&search_field=all_fields&q=wagaw

In this case, the EAD file has <archdesc level="recordgrp">

mejackreed commented 5 years ago

@seanaery are we tracking this concept of the top level not being a "collection" in any other tickets?

While I agree its not inline with EAD specification, a lot of the ArcLight design depends on this construct. So I'm curious about whether we have had design to indicate what other changes would be necessary to accommodate this.

seanaery commented 5 years ago

@mejackreed Great question. Beside this ticket we do have a couple others logged now: #776 & #778 but there might be many more lurking. I'm getting a sense that there's a lot of logic built into ArcLight right now that assumes that the top-level is "collection", and pulling on the thread of that assumption might unravel quite a bit or introduce some major new complexities.

I'd like to explore this indexing approach and see what happens, since it could fix all these issues with a minimal amount of work:

Add "collection" to all the top-level level_ssm field no matter what the actual @level is. Keep the original real @level value too for faceting. Maybe have to do something similar with level_ssi.
See how these changes might relate to #493 where the level is used more for display

anarchivist commented 5 years ago

@mejackreed See #776 and #778, for example.

These are cases where the top-level (<archdesc/>) level attribute is set to recordgrp ("record group"). The level attribute has no semantic constraints across top-level vs. component descriptions and is an enumerated list of values. By far the most common top-level values for this attribute are likely collection, but it's true that any of these levels are possible. See, e.g., Bron, Proffitt, and Washburn (2013):

Table 6: (Wisser Table 9): Values for level within archdesc

_See [Counting Element Occurrences](https://journal.code4lib.org/articles/8956#docs-internal-guid-47977e47-31c1-523d-0d8a-75e91e7d60d8) for the key_ | Element | N | N_uniq | % N_uniq/S | % [N_uniq/n=124009] | % [(N_uniqK)/n=1,136] | diff | | --- | --- | --- | --- | --- | --- | --- | | collection | 116957 | 116957 | 94.31 | 94.31 | 90.90 | 3.41 | | fonds | 135 | 135 | 0.11 | 0.11 | 4.80 | -4.69 | | class | 9 | 9 | 0.01 | 0.01 | 0.30 | -0.29 | | recordgrp | 433 | 433 | 0.35 | 0.35 | 1.40 | -1.05 | | series | 2394 | 2394 | 1.93 | 1.93 | 0.60 | 1.33 | | subfonds | 49 | 49 | 0.04 | 0.04 | 0.30 | -0.26 | | subgrp | 526 | 526 | 0.42 | 0.42 | 1.00 | -0.58 | | subseries | 46 | 46 | 0.04 | 0.04 | 0.00 | 0.04 | | file | 2446 | 2446 | 1.97 | 1.97 | 0.40 | 1.57 | | item | 987 | 987 | 0.80 | 0.80 | 0.30 | 0.50 | | otherlevel | 25 | 25 | 0.02 | 0.02 | 0.10 | -0.08 |

mejackreed commented 5 years ago

Sounds good 👍 .

anarchivist commented 5 years ago

@seanaery

Add "collection" to all the top-level level_ssm field no matter what the actual @level is. Keep the original real @level value too for faceting. Maybe have to do something similar with level_ssi.

Per #778 I'd be in favor of changing the logic to have a different way to specify what's the top-level. Does the change in our indexing strategy allow us to do that easily? For example, can we identify what is/is not a child document?

seanaery commented 5 years ago

Thanks for this feedback @anarchivist -- those stats and that C4L article are really helpful. It's probably possible to do what you suggest. It would be easy to add a field in the indexing process that clearly distinguishes top-level vs. component. Though I think we need to investigate a bit more to get a clearer sense what it'd take to supplant all the logic in the app currently hinging on level to use a different field instead. It could be fairly complicated, but I'm not sure.

One design challenge, regardless of whether we 1) index a top-level recordgrp or fonds, etc. as if it were a collection; 2) replace any level-based logic to use a new top-level / not field; 3) do something else...

Is the label "Collection" still appropriate in the UI even if it stretches the definition beyond the archival definition of the term? E.g., would the "Group by Collection" button still use the term "Collection"? Would the "Collections" link in the primary nav "Repositories | Collections" still say "Collections"?

anarchivist commented 5 years ago

Is the label "Collection" still appropriate in the UI even if it stretches the definition beyond the archival definition of the term? E.g., would the "Group by Collection" button still use the term "Collection"? Would the "Collections" link in the primary nav "Repositories | Collections" still say "Collections"?

I think that's probably fine. My guess is that localization will allow the replacement of "Collections" with another word rather easily. If not - e.g., if this relates somehow to the logic changes, I'm confident that we can either reuse "Collections" as a string or find a reasonable replacement.

projectblacklight / arclight

Group by collection header breaks when EAD top-level isn't a collection #827