ome / omero-web

Django-based OMERO.web client
https://www.openmicroscopy.org/omero
16 stars 29 forks source link

Performance with large numbers of FileAnnotations #517

Closed chris-allan closed 4 months ago

chris-allan commented 7 months ago

On a busy long running system, particularly one where image data is being run through one or more analytical pipelines, the number of FileAnnotations in a given group can easily number in the 1000s. We have exposure to some systems where it numbers in the 10s of thousands. The current "File Attachment" dialog assumes that it can provide a list of every FileAnnotation not currently linked to the selected object(s) for linking; performing quite poorly at 10^3, terribly at 10^4 and causes server errors at 10^5.

Not sure if this is also be a problem for OMERO.insight, @jburel + @dominikl.

This SQL query gives a FileAnnotation vs. group summary to get a sense of the scale on a particular install:

SELECT count(*), experimentergroup.id, experimentergroup.name, experimentergroup.description FROM annotation
    JOIN experimentergroup ON annotation.group_id = experimentergroup.id
        WHERE discriminator = '/type/OriginalFile/' 
            GROUP BY experimentergroup.id;

/cc @emilroz, @muhanadz, @erindiel

chris-allan commented 7 months ago

Suggesting that for first pass mitigation @knabar and I look at possibly limiting the number of objects shown. This will require a little bit of refactoring due to the way the queries are performed and client side lexical sorting by annotation.file.name.

will-moore commented 7 months ago

This has also been reported at #499 via Image.sc

chris-allan commented 7 months ago

Thanks, @will-moore.

I've done a little bit of homework to hopefully help us make some good decisions here. Multiple parent support was added in ome/openmicroscopy#570 which was a rebase of ome/openmicroscopy#557. The initial implementation for "orphan" annotation lookup was added in ome/openmicroscopy@e4534d37711d3d8c24fa09666781d859e750ce69 as part of the refactoring detailed in https://trac.openmicroscopy.org/ome/ticket/3650. The naming is unfortunate as what are looked up are not really orphans but rather annotations that are not linked to the parent object(s). Furthermore, there are various hidden features, such as how objects with an unset namespace are handled, that remain undocumented. It's clear from limited investigation that much of our code relies on these undocumented features.

The call stack is loosely:

omeroweb.webclient.views.annotate_file()

to

omeroweb.webclient.controller.container.getFilesByObject()

to

omero.gateway.BlitzGateway.listOrphanedAnnotations()

As briefly raised by @joshmoore on #499, if we want to handle large numbers of FileAnnotations in a better way, predictable counting and paging are baseline requirements. Achieving this with the current API would be a significant refactor across both omero-py and omero-web and require updating the omero-py>=5.7.0 dependency (over 3 years old) in omero-web.

In order to tackle performance and safety, there are three main problem areas:

  1. Client side sorting performed by omeroweb.webclient.controller.container.getFilesByObject()#sort_file_anns() (an inner function)
  2. Lack of support for paging in omero.gateway.BlitzGateway.listOrphanedAnnotations()
  3. Potential IN clause explosion with multiple parents due to multiple queries being performed in succession

I've been investigating queries to collapse down a bunch of these issues so that we can safely reproduce the sorting and multiple parent features while being able to add counting and paging. After a relational algebra 101 refresher I've come up with the following:

SELECT fa.id FROM FileAnnotation AS fa
    WHERE NOT EXISTS (
        SELECT 1 FROM ScreenAnnotationLink l
            WHERE l.parent.id IN (51, 52)
                AND fa.id = l.child.id
            GROUP BY l.child.id
            HAVING count(l.id) >= 2
    )

That's for Screen parents [51, 52] which has a length 2 for the HAVING clause. The above is not strictly relational division but is inspired by it.

We should be able to make a version of the above generic to all the annotation and parent types as well as add counting, order (sorting), as well as offsets and limits (paging). The requirements for the UI are strictly FileAnnotation.id and FileAnnotation.OriginalFile.name at present.

/cc @kkoz, @sbesson

For reference:

chris-allan commented 7 months ago

Compatible example query which performs sorting and extracts the required information:

SELECT fa.id, ofile.name FROM FileAnnotation AS fa
    JOIN fa.file AS ofile
        WHERE NOT EXISTS (
            SELECT 1 FROM ScreenAnnotationLink sa_link
                WHERE sa_link.parent.id in (51, 52)
                    AND fa.id = sa_link.child.id
                    AND sa_link.details.owner.id = 0
                GROUP BY sa_link.child.id
                HAVING count(sa_link.id) >= 2
        )
        ORDER BY lower(ofile.name)

Edit 1 (2023-12-07): I forgot to add the ownership filter from addedByMe. The addedByMe feature (defaults to True) of listOrphanedAnnotations() was added in ome/openmicroscopy@af1998b16992a9e0b40366824da5e01f1d1ca093 as part of ome/openmicroscopy#460. It ensures that FileAnnotations linked by another user are also considered "orphans" and is currently undocumented. From reading the PR it appears this was originally designed to support adding and removing tags. Usage of this method for tags has long been superseded by the tag dialog added in ome/openmicroscopy#2336.

will-moore commented 7 months ago

I'm just wondering what the UI will look like after these changes / fixes... I have a feeling that the majority of users using the "Add File Annotation" dialog are actually looking to create a new File Annotation from a local file upload and not pick from existing annotations. So it may be worth having a separate dialog for "Link existing FileAnnotation" or at least loading those on request (e.g. if a user clicks a Link existing FileAnnotation button) instead of by default. If the user has e.g. 10,000 File Annotations, it's not ideal to click through a page of e.g 1000 at a time to find the one they want to link (search/filter might be nicer). If you've got those numbers of FileAnnotations, you presumably created them via the API and hopefully linked them correctly to their parents at the time. So is the main use-case of this dialog to re-link annotations that were accidentally unlinked? I guess if this UI is only occasionally used for that, then maybe it doesn't have to be super optimised.

If we're going to offer pagination, do we also need a query to get the total number of "orphaned" FileAnnotations, so we know how many pages? Or we could just offer a "Next" button until we get to the end (incomplete page). Again, that would be sub-optimal but OK for occasional use.

If users are mainly using this to fix accidentally unlinked FileAnnotations, is there some improvement we could make to the "unlink" workflow to help improve the clarity? Current UI doesn't say much:

Screenshot 2023-12-07 at 10 20 46

Final thought - Is there any value is offering the option of choosing genuine orphaned FileAnnotations (not linked to any parent) as well as FileAnnotations not linked to the selected objects? I imagine that mostly you're not interested in linking FileAnnotations that are already linked to other objects? I guess the UI design that informs users of the difference is key if we want to offer that.

chris-allan commented 7 months ago

My personal goals for initial work are solely to make this dialog safe to open. Assuming we can get a query to mimic the current behaviour, what I have above is about 95% there, the changes would then be:

  1. Decide on a reasonable maximum number of "orphans" to show (100 is the Django forms recommended maximum)
  2. Decide on a default sort order for the query (I would probably go with reverse chronological of the file annotation update event)
  3. Implement the query based on 1 and 2 in getFilesByObject() and refactor accordingly
  4. Deprecate listOrphanedAnnotations() (in our codebases it is only used for this workflow, is dangerous, and poorly documented)
  5. Mimic the expected inputs to Django forms in the annotate_file() view
  6. Provide a count of the "orphans" to an expanded FilesAnnotationForm Django forms implementation
  7. Update the files_form.html template to show the count if greater than the maximum and tell the user we're only showing the maximum based on the criteria set in 1 and 2

If more UI rework is desired, it can come after we make this safe IMO.


I have a feeling that the majority of users using the "Add File Annotation" dialog are actually looking to create a new File Annotation from a local file upload and not pick from existing annotations.

Agreed wholeheartedly. I expect a great many users have never even considered the fact that you can crosslink file annotations and they completely ignore the bottom part of the current dialog.

So it may be worth having a separate dialog for "Link existing FileAnnotation" or at least loading those on request (e.g. if a user clicks a Link existing FileAnnotation button) instead of by default.

Certainly an option.

If the user has e.g. 10,000 File Annotations, it's not ideal to click through a page of e.g 1000 at a time to find the one they want to link (search/filter might be nicer).

Agreed. Nor does the current UI, with a display of filename only, give the user enough context to make that decision in a large number of cases. Search or filter would definitely be nicer. The current UI utilizes Django forms and the ModelChoiceField field type which recommends its use for no more than 100 items:

If you've got those numbers of FileAnnotations, you presumably created them via the API and hopefully linked them correctly to their parents at the time.

That is certainly the case for our users.

So is the main use-case of this dialog to re-link annotations that were accidentally unlinked?

It's been over 12 years since we created the dialog and it basically hasn't changed since. I'm quite confident the original intent did not consider programmatic association of hundreds of thousands of file annotations and shared groups of dozens of users (our use case). As for exactly what the "main use-case" is, at least to me it's for linking existing known file annotations with another object not re-linking annotations that were accidentally unlinked. But that's wild speculation, I'm not aware of anyone actually using this feature and if they are we have no real concrete way to assess how or why they are using it.

The vast majority of our deployments have a small number of shared read-annotate or read-write groups. In such circumstances this dialog is nearly entirely useless as it becomes quickly polluted with the file annotations of everyone in the group.

If we're going to offer pagination, do we also need a query to get the total number of "orphaned" FileAnnotations, so we know how many pages?

Yes, but that's pretty straightforward to do if we have a version of the query above ready.

If users are mainly using this to fix accidentally unlinked FileAnnotations, is there some improvement we could make to the "unlink" workflow to help improve the clarity?

Of course. Unlink vs. delete is already an unnatural and complicated topic across the entirety of OMERO.

Is there any value is offering the option of choosing genuine orphaned FileAnnotations (not linked to any parent) as well as FileAnnotations not linked to the selected objects?

I'm sure there is but I expect that quickly takes us into a discussion not just about linking file annotations but rather file annotation management as a whole which we have essentially no graphical UI for in either OMERO.web or OMERO.insight.

I imagine that mostly you're not interested in linking FileAnnotations that are already linked to other objects?

I would say no, they're not but again that is speculation. I don't think any of us have the data to support such an assertion and given what I expect we would all agree on, almost no one uses this feature, that data would be pretty weak anyway.

will-moore commented 7 months ago

Agree with all those points and plan, with 1 exception: I think you could make the cutoff limit greater than 100, and possibly stick with sorting by name too. I understand the Django advice of "don't use it for more than 100 items" but we know that you can use it for more than that and it is certainly better to have a long list with sub-optimal usability than to have a list that is truncated and doesn't have the items you want. It would need testing with different numbers, but if you were to test with e.g. 250 or 500 items, then I expect the performance hit of loading these wouldn't be too high, the select would still be usable and you would be less likely to break existing user's workflows?

chris-allan commented 7 months ago

I'm honestly not that opinionated on the maximum we just need to decide on one. The query is going to perform significantly better than it does now because we will not be loading the entire FileAnnotation <--> OriginalFile object graph, which also currently contains Experimenter instances, for the two fields the UI requires. While I think it's already pretty unusable at 100 items, I expect we could handle 1000 items quite easily.

I think showing the most recently added file annotations at the top of the list is far more natural but as I expect we've already established, the use of this feature has got to be low and it's already unusable for so many use cases keeping it lexical is also fine. It is not going to be difficult to change if we decide on something different later on.

chris-allan commented 7 months ago

A first pass for discussion during tomorrow's web meeting:

will-moore commented 7 months ago

Thanks Chris. Do you want to open a (draft) PR for that branch, so we can discuss there?

I just wrote a little script for generating lots of FileAnnotations if it's useful for testing anywhere. Created a bunch of them on merge-ci for when that PR is open, although I think draft PRs aren't merged by default.

will-moore commented 7 months ago

Sorry - script is at https://github.com/will-moore/python-scripts/blob/main/create_file_annotations.py

jburel commented 6 months ago

insight uses directly the method from the MetadataService i.e. https://github.com/ome/omero-server/blob/master/src/main/java/ome/logic/MetadataImpl.java#L575

This is the call used for example when expanding the attachment tab

jburel commented 6 months ago

Insight has a 2 steps process to add attachments

as mentioned earlier, insight does not use use the Java gateway (which has a loadAnnotations) method so that will need to be reviewed since we could have 2 entry points that could create trouble insight and java-gateway. First step should be to modified insight to use the java gateway and then review the java-gateway implementation I will create tickets cc @dominikl

chris-allan commented 6 months ago

insight uses directly the method from the MetadataService i.e. https://github.com/ome/omero-server/blob/master/src/main/java/ome/logic/MetadataImpl.java#L575

Understood. I don't think the server side MetadataService currently offers the functionality for loading already uploaded FileAnnotations which may be able to be linked to the current object hierarchy discussed in this issue.

jburel commented 6 months ago

That the method we use to load the annotations owned by a user (attachments tabs for example) If a user has plenty of annotations this could become problematic

chris-allan commented 6 months ago

That the method we use to load the annotations owned by a user (attachments tabs for example) If a user has plenty of annotations this could become problematic

Understood. We definitely have cases where one user (a service account) would have many, many FileAnnotations.

chris-allan commented 4 months ago

@knabar / @will-moore: I expect with #519 having been merged and released in 5.24.0 we can close this issue?

chris-allan commented 4 months ago

Also, we should post on https://forum.image.sc/t/unable-to-add-attachment-using-omero-web/84146/18 as well, no?

will-moore commented 4 months ago

Thanks Chris. Posted on image.sc and closing this...