ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[bug?] Nuxeo component object ordering is not retained in Calisphere complex object viewer #985

Closed christinklez closed 1 month ago

christinklez commented 5 months ago

Expect to see the Nuxeo component object sort order being retained in the Calisphere item viewer.

Nuxeo fetcher code: https://github.com/ucldc/rikolti/blob/main/metadata_fetcher/fetchers/nuxeo_fetcher.py#L91

Nuxeo

Nuxeo object: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/06e49f86-fd49-43d6-82b9-14f07094eada/view_documents

Expected ordering of component objects:

Calisphere-stage

Calisphere-stage object: https://calisphere-stage.cdlib.org/item/06e49f86-fd49-43d6-82b9-14f07094eada/?order=0

Order of records in the vernacular metadata:

christinklez commented 5 months ago

I recreated 3 UCB EDA objects, all of which are complex objects that are not displaying in the intended Nuxeo order. (Stage has a different order.)

Experiment 1: Blake Estate: Long Range Development Plan - Manually uploaded each component in the order presented in Nuxeo - did not reorder

Experiment 2: Cole, Fraser - manual upload (uploaded in the order of how they are listed in stage), then reordered (to how they are listed in Nuxeo)

Initial upload sequence: Upload order to match how it appears on -stage. Image

"Fixed" (to match original Nuxeo record) sequence: Reordered to match the Nuxeo record. Image

Experiment 3: Eckbo, Garrett (ALCOA Forecast) - uploaded via the File Uploader, then renamed, then reordered in the UI (to match Nuxeo's order)

Note that my hunch is UCB EDA created these objects by uploading via the File Uploader client. I'm pretty certain of this given that the component objects filepath all have the original filenames (set by the very initial document title).

Initial upload sequence: This is how the File Uploader uploaded. Image

Renamed the component objects (still in uploaded sequence): Image

"Fixed" (to match original Nuxeo record) sequence: Image

barbarahui commented 5 months ago

@christinklez I tried fetching the metadata for your recreated test Eckbo object, and the children are in order!!

@amywieliczka noticed that for problematic objects, the order of the child components in the left sidebar in nuxeo doesn't match the order on the main page for the parent object, by the way. As we know, there are issues with ordering in Nuxeo...I'm not sure what method in particular is causing the problem.

barbarahui commented 5 months ago

It turns out that the @search nuxeo API endpoint returns component objects in the wrong order sometimes. The search/lang/NXQL/execute endpoint described here returns them in the correct order: https://doc.nuxeo.com/nxdoc/search-endpoints/#searching-by-query

This PR updates the API endpoint for the query that gets the child component objects: https://github.com/ucldc/rikolti/pull/994

barbarahui commented 5 months ago

Once it's deployed, we'll need to reharvest Nuxeo collections with the component ordering issue. @christinklez I don't suppose you have an exhaustive list of which collections are affected? I'm sure it's really hard to figure out just by looking at the website. I could write a script to compare the ordering for all collections...

barbarahui commented 5 months ago

OpenSearch query for Nuxeo collections with complex objects:

GET rikolti-stg/_search
{
  "query": {
    "nested": {
      "path": "children",
      "query": {
        "match_all": {}
      }
    }
  },
  "aggs": {
    "collection_ids": {
      "terms": {
        "field": "collection_url",
        "size": 10000
      }
    }
  },
  "size": 0
}
christinklez commented 5 months ago

28197 reharvest results: The ordering is significantly better! But there were two objects that were not in the correct sequence.

Expect (according to Nuxeo):

Image

Actual (on -stage):

Items 7 & 8 ordering (on -stage) is swapped: https://calisphere-stage.cdlib.org/item/06e49f86-fd49-43d6-82b9-14f07094eada/?order=6

Nuxeo again

I went into the Nuxeo record and clicked "Edit" and then "Save" (without making any changes). I also clicked the "refresh" icon on the upper right of the component object listing. The display order updated, for the component objects. This order matches what is coming through on -stage (as well as the vernacular fetched metadata). Image

christinklez commented 5 months ago

28199 reharvest results:

Cole, Fraser record sequence matches Nuxeo: https://calisphere-stage.cdlib.org/item/cf56d5e9-b688-42f7-981e-32c551783183/

Eckbo, Garrett (ALCOA Forecast) record sequence matches Nuxeo: https://calisphere-stage.cdlib.org/item/17eef353-3469-45a3-a64d-973e982dda87/

christinklez commented 5 months ago

Just Zoomed with @barbarahui about this, but we are witnessing different component object sequencing in the Nuxeo UI depending on the Nuxeo record URL...

(PS: My "edit" and "save" attempt may be a red herring. There may be something to "how you get to a record" that may affect the display order?)

christinklez commented 5 months ago

I sample QA'd three more UCB EDA collections:

Each of these collections have at least one record with component objects that do not match the Nuxeo sequence. I reharvested all three collections (with the updated Nuxeo endpoint) and the Nuxeo sequence now matches the -stage sequence.

These are screen captures from before the reharvest (i.e., before the Nuxeo endpoint update), which show that the Nuxeo sequence did not match the -stage sequence:

Farrand: https://calisphere-stage.cdlib.org/item/56402aab-45a6-4f64-8a48-56c4cb6d6a3c/ Image

Royston: https://calisphere-stage.cdlib.org/item/ec8f5fbd-b518-493a-b7e7-89c7ea732db6/ Image

Church: https://calisphere-stage.cdlib.org/item/2f1664a4-d05b-4f66-bb9d-8e575206bab9/ Image

barbarahui commented 5 months ago

I filed a Nuxeo JIRA ticket for this: https://jira.nuxeo.com/browse/SUPNXP-51504

barbarahui commented 5 months ago

JSON report containing information on the 187 items (across 25 collections) where we think the order of the complex object components on calisphere-stage doesn't match what is in the Nuxeo prod UI: https://drive.google.com/drive/folders/1IMknt9UxHxZRduDJ6ILrIJUCh2H5fKwz

barbarahui commented 5 months ago
Collection ID  Count of complex objs with ordering problem
27809          1
27124          13
26883          45
26466          2
26887          1
26771          5
65             5
26864          1
27569          16
26713          54
26147          1
27598          16
28203          2
26677          1
26945          5
1324           1
26895          4
27594          2
28040          3
15586          3
18409          1
25889          2
28197          1
472            1
4908           1
barbarahui commented 5 months ago

I did some more digging, and not all of the problematic objects are missing pos (position) in the database. A couple of different examples I found when spot checking:

1987_0222_UCI_Choir : California and UCI Chamber Singers

Nuxeo: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/8610b0c8-f3f7-4978-b264-f10b5dddd112/view_documents Calisphere stage: https://calisphere-stage.cdlib.org/item/8610b0c8-f3f7-4978-b264-f10b5dddd112

The issue with this object is that there are grandchildren nested inside a couple of the children. Child 1987_0222_UCI_Choir_1of2 has an object nested inside of it. Child 1987_0222_UCI_Choir_2of2 does as well. The old fetcher was picking up these nested grandchildren and so they are displayed in calisphere mixed in with the children. Since I modified the fetcher to fix https://github.com/ucldc/rikolti/issues/1000 it will no longer fetch grandchildren, only direct children of the parent object. Is this correct?

A system of ethicks. /By the Reverend Mr. Henry Grove. [Vol. 2]

Nuxeo: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/097c0984-4e1f-43b5-b39b-101558ae3921/view_documents Calisphere stage: https://calisphere-stage.cdlib.org/item/ark:/21198/n1j60r/

Pages 240 and 241 are reversed in Nuxeo:

Image

The are not reversed on calisphere-stage. These objects have pos values in the database, so I don't understand why the fetcher is getting them in a different order from what's in Nuxeo.

[UPDATE] It looks like this particular query and endpoint (which the fetcher was using before we made a couple of changes), returns page 240 and then 241:

Query:

Select * from document where ecm:path startswith '/asset-library/UCLA/clark/mss/09/uclaclark_ms1976010-2_Tiffs' AND ecm:isVersion = 0 AND ecm:mixinType != 'HiddenInNavigation' AND ecm:isTrashed = 0 ORDER BY ecm:pos ASC

Endpoint:

https://nuxeo.cdlib.org/Nuxeo/site/api/v1/path/@search

However, the query and endpoint we are currently using returns page 241 and then 240 (matching what is in Nuxeo):

Query:

Select * from document where ecm:parentId =  '097c0984-4e1f-43b5-b39b-101558ae3921' AND ecm:isVersion = 0 AND ecm:mixinType != 'HiddenInNavigation' AND ecm:isTrashed = 0 ORDER BY ecm:pos ASC

Endpoint:

https://nuxeo.cdlib.org/Nuxeo/site/api/v1/search/lang/NXQL/execute

My head is going to explode 🤯

[MORE UPDATE] The 2 children that are coming back in an inconsistent order have the same pos value of 430 in both the database and elasticsearch. This definitely seems like a bug. I'll file it with Nuxeo.

nuxeo-> WHERE parentid = '097c0984-4e1f-43b5-b39b-101558ae3921' and pos = 430;
                  id                  |               parentid               | pos |              name              | isproperty |     primarytype     | istrashed
--------------------------------------+--------------------------------------+-----+--------------------------------+------------+---------------------+-----------
 7b0233ea-37c5-44f5-8df0-294ed633f7e3 | 097c0984-4e1f-43b5-b39b-101558ae3921 | 430 | uclaclark_ms1976010-2_0246.tif | f          | SampleCustomPicture |
 320d3025-c899-4ea2-83af-7023d0c62f2e | 097c0984-4e1f-43b5-b39b-101558ae3921 | 430 | uclaclark_ms1976010-2_0247.tif | f          | SampleCustomPicture |
(2 rows)

I filed this as a separate JIRA issue: https://jira.nuxeo.com/browse/SUPNXP-51574

barbarahui commented 5 months ago

I did some testing to see if I could fill in pos values for objects missing them via the UI. I was able to get the values populated, but not in the order that I would have expected. It might be easier to demonstrate on zoom, but here are the steps I followed:

  1. Copy and paste a problematic object into my personal folder on nuxeo (UCOP/barbaratest)
  2. Select the first component object by clicking the box on the left
  3. Click the "move up" button that appears below the component objects
  4. Logout of nuxeo and log back in
  5. The component objects in the main page and in the left sidebar now both have the same order, and the order remains consistent. The database also now has pos filled in. The fetcher now retrieves the objects in this order consistently.

HOWEVER, the order is nothing like what was displayed in the UI before this. So I think that the best we can do to fix the objects with missing pos is to programmatically order them alphabetically by title or by filename. Then users will have to manually reorder them if this is incorrect.

This would just be a fix for existing records that are missing pos. We still need to replicate the workflow that results in complex object components to be created without pos and fix it so that we don't get new objects with this problem going forward.

barbarahui commented 5 months ago

The secondary issue we discovered, where more than one child object has the same pos value, is thankfully not that widespread. Here's the info on the 4 complex objects with the problem:

097c0984-4e1f-43b5-b39b-101558ae3921

A system of ethicks. /By the Reverend Mr. Henry Grove. [Vol. 2] https://calisphere-stage.cdlib.org/item/ark:/21198/n1j60r/ https://nuxeo.cdlib.org/nuxeo/nxdoc/default/097c0984-4e1f-43b5-b39b-101558ae3921/view_documents Collection 26887 329 components total 2 components have pos 430

63edf544-4746-4211-adc7-1bf05edec202

https://nuxeo.cdlib.org/nuxeo/nxdoc/default/63edf544-4746-4211-adc7-1bf05edec202/view_documents https://calisphere-stage.cdlib.org/item/ark:/87280/t0np22c5/ Collection 26147 2 components total 2 components have pos = 0

f026c534-b844-46a3-ab70-e3637cf71e12

https://nuxeo.cdlib.org/nuxeo/nxdoc/default/f026c534-b844-46a3-ab70-e3637cf71e12/view_documents Not on calisphere 160 components total 2 components have pos 54 2 components have pos 98

0e45044b-c08a-45ef-bf38-76f065878dd5

https://nuxeo.cdlib.org/nuxeo/nxdoc/default/0e45044b-c08a-45ef-bf38-76f065878dd5/view_documents Not on calisphere 706 components total 305 pairs pos 1-305 are doubled

barbarahui commented 5 months ago

A list of the 1227 complex objects (by path) in Nuxeo whose children have no pos in the database:

complex_obj_null_pos_paths.txt

This includes objects that aren't published to Calisphere.

barbarahui commented 5 months ago

A file with more complete info (uid, path, title) on the 1227 complex objects whose children have no pos in the db:

parent_obj_no_pos.json

aturner commented 4 months ago

@barbarahui sample object #13, with components that have metadata from nuxeo_spreadsheet import (following UCB EDA's method):

/asset-library/UCOP/aturner/orderingtest/Example 12.5962533403185838 https://nuxeo.cdlib.org/nuxeo/nxdoc/default/db065b59-8ef8-44e1-a902-1f1e58e988ad/view_documents

aturner commented 4 months ago

Summary of next steps:

barbarahui commented 4 months ago

Updated list of paths for complex objects that have the null pos problem. Objects with only one component have been filtered out.

complex_obj_null_pos_paths.txt

christinklez commented 4 months ago

Tracking emails to campuses:

Spreadsheet with list of parent documents: https://docs.google.com/spreadsheets/d/1Pej1YCP6tB8nERkZx3vX1CbbpzaADivV2GQKE7hBvyw/edit?gid=0#gid=0

christinklez commented 3 months ago

@barbarahui @aturner -- an update that we've received confirmation from all campus units that it's okay to run the positioning script for all Nuxeo complex objects currently missing position numbers.

@aturner -- the spreadsheet also includes indicators on which collections need to be reharvested. We can touch base on that later, after the positioning script has been run.

Thanks!!

christinklez commented 3 months ago

@barbarahui -- before running the positioning script, would you be able to do one more run to identify documents that have component objects with missing positions?

Jason at UCB EDA went ahead and touched all of the objects and is currently happy with the current position order of their collections, and provided approval to publish these to Calisphere. I let him know that we'll do a double check on these collections to check if he missed any. Since he's currently happy with the component object ordering, he doesn't want them to be arranged by filename (if they don't have positions) and would prefer to use the "move" functions to trigger the numbering.

Thank you!

cc: @aturner

https://help.oac.cdlib.org/a/tickets/142259 (reminder to self, to send updates to Jason through this thread)

barbarahui commented 2 months ago

@christinklez I'm attaching a list of all of the documents with missing positions. They're ordered alphabetically, so UCB's are at the top.

complex_obj_no_order_paths_2024-09-16T18:21:24.PDT.txt

christinklez commented 2 months ago

New tab in this spreadsheet: https://docs.google.com/spreadsheets/d/1Pej1YCP6tB8nERkZx3vX1CbbpzaADivV2GQKE7hBvyw/edit?gid=1496331430#gid=1496331430 -- for UCB EDA documents only.

There were 12 objects (from their newest most recently harvested/published collections) that would be impacted by the positioning script. I've messaged Jason about those 12 objects. https://help.oac.cdlib.org/a/tickets/142259

christinklez commented 2 months ago

Got the okay from UCB EDA to go ahead and run the position number script on their objects as well! Please feel free to run the position numbering script. Thank you!

barbarahui commented 2 months ago

I ran the script to assign an order value to all of the complex object components in Nuxeo that were missing them.

Summary: Updated 36459 children of 1196 objects

Note: the number of objects is higher than what was in complex_obj_no_order_paths_2024-09-16T18:21:24.PDT.txt because that report only lists parents with more than one component.

PR: https://github.com/ucldc/nuxeo-component-ordering/pull/1

I'm attaching a json file containing data on the updates, just in case we need to refer back to it at any point: null_order_fix_report_2024-09-27T16_21_02.PDT.json