Expert's documents can get ridiculously large

qjhart commented 2 months ago

Example: https://experts.ucdavis.edu/expert/48xkGvFK

This the API file is 73M, On my speedy machine this takes ~10s to load. This is the bulk of the time that it uses to load.

There are a number of issues at play here. First, a considerable amount of this data are the 1000s of additional authors that exist for each citation, originally we had an additional modification to experts cdl where we would stop authors at 40, (but add the last author). I'm a little bit conflicted on the use of this. Another user (say for example the author) might be interested in seeing all the authors for some specific reason.

Another issue is that in most circumstances we are looking at very little of the expert.

If we followed the idea from Fedora, we could add some additional representations on our Prefer header, and make some additional limitations on these components. We could limit the page and count, and we could even trim authors from our display.

Proposed API Updates

In the /api/expert/<id> GET route, we can add the following:
- filtering of grants and works to return 5 grants and 10 works
- the grants and works would need to be filtered by date (end date for grants, issued date for works) descending and title/name ascending
- we could add a ?full param to the endpoint to return the entire document with all grants/works for the expert
- also return counts for the number of works and grants, and the number of hidden works and grants
There's a potential for adding favorites and order filters to the API, but we'll discuss in https://github.com/ucd-library/aggie-experts/discussions/524.
Create new routes for getting paginated results of grants and works. For example:
- /api/expert/<id>/works?page=2&size=25
- /api/expert/<id>/grants?page=2&size=25
- page would default to 1 and size (number of results) to 25
Should we change how the no-sanitize flag works? It's undecided, but perhaps we should let the API code handle returning data sanitized if the user isn't looking at their own profile (or impersonating their profile) and the user isn't an admin. Should that logic be removed from the client?

qjhart commented 1 month ago

@UcDust this looks good.

[ ] For the works/grants endpoints, Do we have a method to retrieve all the results? (or just size=100000? )
[ ] how do we want to see the total counts for grants and citations on the expert page?

qjhart commented 1 month ago

@UcDust, I think maybe the easiest thing to do is refactor the sanitize as something like subselect and accept

expert.subselect(doc, { sanitize:true,
  expert:true|false
  grants:{ page:1,size:25},
  works:null,
  })

This doesn't really match the API calls though, but it does allows us to get the default page with:

expert.subselect(doc,{sanitize:true,
  expert:true|false
  grants:{ page:1,size:25},
  works:{page:1,size:25}
  });

I do see a problem in that the counts will be affected on the sanitization step. so your cache would have to include both. This is one reason to not have the server guess that I suppose.

UcDust commented 1 month ago

@UcDust this looks good.

[ ] For the works/grants endpoints, Do we have a method to retrieve all the results? (or just size=100000? )

[ ] how do we want to see the total counts for grants and citations on the expert page?

@qjhart To retrieve all results, what if we added another param for ?full or ?all that would return all grants/works for that expert? Just using a huge size could work too. For total counts, could we have a structure like:

hits: {
    works: {
        total: 27,
        visible: 24
    },
    grants: {
        total: 7,
        visible: 4
    }
}

(not sure on the hits verbiage, but something along those lines maybe?)

UcDust commented 4 weeks ago

@qjhart I created the https://github.com/ucd-library/aggie-experts/compare/dc-api-subselect branch with a start to the sanitize logic changes.

We'll need to optimize more once we analyze the type of sorting we can do on grants/works, and the client needs to be wired in still.

Also, admin mode (and for users own profile) is sending the ?no-sanitize flag still, which bypasses this logic. So we'll need to think of an approach there, perhaps removing that.

ucd-library / aggie-experts

Expert's documents can get ridiculously large #487

Proposed API Updates