m2ms / fragalysis-frontend

The React, Redux frontend built by webpack
Other
1 stars 2 forks source link

Bulk tag assignment outside of frontend #1263

Open phraenquex opened 10 months ago

phraenquex commented 10 months ago

Need a table/spreadsheet modal to allow rigorous tag assignment.

Include "create tag" widget from Tag details panel.

mwinokan commented 8 months ago

@phraenquex is this still needed? As we discussed today the bulk tag assignment options are not terrible...

mwinokan commented 6 months ago

@phraenquex and @mwinokan discussed that data curators need a way to edit tags in bulk outside of Fragalysis. Whether in Excel or programmatically. Can @kaliif's mechanisms for exporting and importing tags have a frontend UI so that data curators can download a CSV of all their tags and re-upload them?

kaliif commented 5 months ago

@mwinokan I need an example tag upload file.

mwinokan commented 5 months ago

@kaliif what is the current format of the tag export you implemented? I will think about a spec, but it would help if you could briefly comment on how difficult it is to parse a CSV and modify entries for the following:

kaliif commented 5 months ago

@mwinokan example attached. Basically, generated tags in their own column, curated tags (no such in the file) in separate columns, and a boolean flag in the cell to indicate presence. The column '[Other] New' is a category that was added later, that might be a bug..

Parsing a csv is not difficult but obviously, it needs to be clear and unequivocal schema. For importing in the backend I could very well work with key-value pairs but understandably this is not very human-friendly.

Not sure what you mean by modifying entries, do you need me to change something in the backend?

metadata.csv

mwinokan commented 5 months ago

Thanks @kaliif, could we support the following pipeline:

  1. User downloads the metadata.csv
  2. User modifies the metadata.csv in Excel or otherwise
  3. User uploads the modified metadata.csv
  4. Backend parses the uploaded file and updates the DB with supported changes e.g:
kaliif commented 5 months ago

Until an observation can belong to only one pose and pose has no tags (or just one tag) this table structure will work just fine.

Compound aliases will be difficult to fit in, there's already one-to-many in the file between cite observation and compound, this may be confusing for users to edit. And if there's any metadata to be associated with aliases the input structure would be even more complex. I suspect this needs to be a separate file. I can create a stub in the download, maybe something like:

| compound_id | alias type 1 | ... | alias type n |
|-------------+--------------+-----+--------------|
| 1           |              |     |              |

Would that work?

mwinokan commented 5 months ago

@kaliif yes I think that works great, thanks

kaliif commented 5 months ago

@mwinokan do you have the list of alias types you'd like to use? Ideally, I'd map them to database columns and then they'd need to be known strings. I could of course link them via m2m table and have the database generate types on the fly but I've seen many times how quickly this gets unmanageably messy so I'd rather use the columns.

mwinokan commented 5 months ago

@mwinokan to talk to @phraenquex and see if on-the-fly alias type creation is necessary

mwinokan commented 5 months ago

@kaliif I'm checking with data curators on Slack (trying to avoid on-the-fly alias support)

phraenquex commented 5 months ago

It'll certainly be necessary for alias types to be added in the middle of a project.

But on reflection, that shouldn't be done magically during upload, you'd want them added actively, and the upload to be rejected (with informative message!) if it has unregistered types. (Else you quickly accumulate typos that create infuriating UI behaviours.)

So the main question to resolve, is what's the mechanism of adding types - a simple front-end modal might do the job, accessible from the hamburger menu.

mwinokan commented 5 months ago

@phraenquex from the Slack discussion it seems like we may be fine with just two alias types. @kaliif said in the meeting that it would be much simpler not support on-the-fly alias type creation.

phraenquex commented 5 months ago

@mwinokan I'm not sure the Slack discussion pinned down the use-case (from what I can tell).

Copying from the Slack:

with "type" I had in mind things like [ASAP, CDD, Enamine, ZINC, READDi, SGC, CMD, etc] With each compound possibly having an alias for each type. So definitely dynamic and specific per project. (If this is already covered in the spec, then ignore. But I could not work out what we would practically do with catalogue and project.)

mwinokan commented 5 months ago

@phraenquex I think it would be overkill (and not the best use of our budget) to support any number of arbitrary alias types in the backend and frontend. I think the having two aliases is sufficient for all the projects I have been involved in:

Having these two also makes the hierarchy clear: display the project alias is one is available, else display the catalogue alias.

CDD uploads for example use the ASAP ID anyway, I don't think it's best practice for us to promote the use of so many different monikers of the same compound.

phraenquex commented 5 months ago

No, in general, compounds should be able to have arbitrary many aliases. (Even if in practice it might never be more than 2 or 3.)

And in the F/E, you need to be able to change which aliases ("alias type") are being shown.

Hence my spec.

If that creates problems, please spell out what they are - and how it explodes the spec.

mwinokan commented 5 months ago

@phraenquex for the arbitrary aliases we would need:

This is on top of the work to:

Your call

mwinokan commented 5 months ago

@phraenquex says: the same compound may have different names in different contexts, e.g. for ASAP there are Enamine/Postera/CDD ID's for each compound, that must be captured in Fragalysis.

However the alias class/type should be a validation step, on upload there must be a way to specify the types of aliases relevant to the target/project that future CSV uploads will be validated against. This will require UI at upload, and once the data is live in the f/e as well.

This impacts searching for compounds as well, we will likely need to revisit how observations can be searched via their various aliases (and in future other properties).

@tdudgeon wonders whether the aliases will need to be intelligently applied across targets where compounds are reused. @phraenquex says that the aliases should be kept only to the target.

mwinokan commented 5 months ago

@kaliif please implement the spec discussed earlier in this ticket regarding the bulk tag assignment. We will continue the discussion regarding the aliases in #1412

kaliif commented 5 months ago

@mwinokan can you give me an example file that contains the data for the simplified spec. Also, how would the upload work? Does it require a new endpoint or is it going to be using one of the existing mechanisms?

Waztom commented 5 months ago

@Waztom will review csv and get data curators feedback re:

  1. Re-uploading data
  2. Overwriting existing tags
  3. Creating new tags
phraenquex commented 5 months ago

@Waztom @kaliif I can answer all those:

  1. Tags should be preserved when re-uploading data
  2. Existing tags should not be overwritten. (Curators must delete them in the frontend, and that's deliberate.)
  3. New tags should be created as required. For this purpose, the tag csv should have columns for each of the tag attributes, but allow the columns to be empty. (colour as #RGB string; I forget what the others are.)
tdudgeon commented 5 months ago

My expectation would be that a user would create a "tag file" and continually update it and then want to reload that data. Therefore it would seem necessary to have an option to wipe existing tags and reload the updated file.

kaliif commented 5 months ago

I've been working with the file metadata.csv from the downloads as the base file. On the first download, this is populated with existing tags - generated tags are in their columns by category with tag name in the cell, curated tags are in their own column and boolean value in the cell marks presence for the site observation. This template can be easily expanded with new tags.

metadata.csv

phraenquex commented 5 months ago

@tdudgeon @kaliif It's critical that tag creation and editing in the frontend is fully respected. So yes, a data uploader might have their tag file and upload that every time; but that can't step on things done in the frontend.

@Waztom, you and I should pin down a workflow spec today, so that @kaliif and @matej-vavrek can stop having to guess.

(I'm pretty sure curators won't have strong views on the precise workflow, they just need it to be completely predictable, and well documented.)

phraenquex commented 5 months ago

@kaliif is it a bug or feature that some tag columns in metadata.csv are TRUE/FALSE, and others 0/1?

kaliif commented 5 months ago

@phraenquex feature. I assumed users fill this out by hand and use different ways to mark the tags.

Waztom commented 5 months ago

@kaliif and @tdudgeon spoke to Frank yesterday and came up with the following spec:

  1. The Tag file will update/overwrite all tags - it is the responsibility of the data curator to edit tags with the understanding that whatever they use will overwrite existing tags.
  2. The Tag file will be the metadata.csv with three additional columns ("Curated tag", "Tag edited user", "Tag edited date")
  3. Backend tracks when Tags are changed and the User responsible for bulk uploading and frontend tag changes. @phraenquex we spoke about updating a file with this info everytime a Tag is changed. Alternative is to capture Tag before/after state only. Editing one tag at a time on the frontend will generate too many files? Data curator also likely most interested in most recent edits?
kaliif commented 5 months ago

@kaliif and @tdudgeon spoke to Frank yesterday and came up with the following spec:

1. The Tag file will update/overwrite all tags - it is the responsibility of the data curator to edit tags with the understanding that whatever they use will overwrite existing tags.

What is meant by update and overwrite?

2. The Tag file will be the metadata.csv with three additional columns ("Curated tag", "Tag edited user", "Tag edited date")

Need an example file.

3. Backend tracks when Tags are changed and the User responsible for bulk uploading and frontend tag changes. @phraenquex we spoke about updating a file with this info everytime a Tag is changed. Alternative is to capture Tag before/after state only. Editing one tag at a time on the frontend will generate too many files? Data curator also likely most interested in most recent edits?

Same as 1., how do you imagine changing the tags?

mwinokan commented 3 months ago

@kaliif sorry for the behemoth spec, please let me know if you have any questions or concerns!

@Waztom and I discussed what is now hopefully the final spec for this ticket.

The general workflow that should be supported for any deployed target:

  1. User downloads the data as usual
  2. User modifies the metadata.csv
  3. User uploads the metadata.csv to a new endpoint

Changes to download CSV

The metadata.csv should remain largely unchanged, except the order of the columns. This is to simplify the communication of the validation logic on which columns are editable by the user. The order should be:

2. Modification of the metadata.csv

With this new layout of the metadata.csv the columns will be split into three logical groups:

3. New metadata update API endpoint

A new upload endpoint and minimal django UI will be needed to allow the user to upload a modified CSV:

Some suggested validation rules (which should have sensible error messages):

Example data

Original metadata.csv from A71EV2A staging

metadata_orig.csv

Reordered metadata.csv with added Pose column (from new spec above)

metadata_reordered.csv

Example modified metadata.csv (which does not break any of the rules)

metadata_modified_ok.csv

Example modified metadata.csv (does not break any rules, saved as Excel file)

This one might be useful because I have filled in the cells grey to indicate ones that are not allowed to be modified by the user, and green those that I have modified metadata_modified_ok.xlsx

Example modified metadata.csv (where I have sprinkled in bad data and saved as Excel file)

Again this one has useful cell formatting to indicate the bad data in red metadata_modified_bad.xlsx

kaliif commented 3 months ago

@mwinokan @Waztom I've read it now and (not having started working on this yet) I'd say I understand all the points here. And it seems to me (again, not having started actual work yet) that there's nothing terribly complicated; it's a fair amount of work, but I'd say we're talking about days here, not weeks.

But I'd like to point out that you're asking the users to fill out big and complex spreadsheet by hand and in my experience, this is going to be a pain. People are just not very good at this. Yes, I can run a validation on the file and send the user feedback on what's wrong but, for example, [Othrr] is not a known category, so I'll have to throw an error here. And only then, when the user fixes this to [Other] can I move on to all the creative ways they've filled out the cells. So there's likely going to be several attempts to upload with frustration rising with each round.

I don't have a problem taking this task on, but I think you should discuss it with the front-end guys about solving it there. I've done enough front-end work to know it's going to be a pain to implement but with this level of user input needed, it seems to me this is the correct place to do this.

mwinokan commented 3 months ago

Thanks @kaliif, glad to hear that the b/e work seems achievable on a timescale of days.

Regarding whether this should be a f/e feature instead, we did think about this, but as you say it is a significant amount of work to essentially rebuild a spreadsheet software in Fragalysis. This CSV I/O feature would be for power users who are either already quite meticulous with their spreadsheets (e.g. Jasmin & Ryan) and the functionality will be very useful for computational folk (like myself) who will add tags algorithmically. The default advice for users will be to use the frontend and carefully modify individual observations.

So please go ahead with the spec

kaliif commented 3 months ago

@mwinokan I've been working on this ticket for a couple of days now and some questions have risen:

mwinokan commented 3 months ago

Thanks @kaliif

You mentioned several times in the spec that the user should not be allowed to change the first 12 columns.

I think it's fine to ignore those columns for now. As long as the short / long observation code can still be used to match to the relevant DB entries.

The aliases for a given XCA tag upload name should be consistent across the file

By this I meant that if a user changes the alias for an XCA generated tag, i.e. 3 - His71 to 3 - Allosteric then that should be consistent across the whole document. I.e. 3 - His71 should not appear anywhere else, and nor should another alias for Canonical Site 3. Please also check for the correct syntax: X - YYYYY

Each values of the pose column should match an observation shortcode in the data

In addition to validating any values in the uploaded Pose column to be referencing an existing observation, please also update the database to reflect any changes to poses such as creation, deletion, reassignment.

Track timestamp and fed ID of any modifications in the database

With regards to deleting tags, I am not sure what the precise desire for the spec is (@phraenquex please comment if you have a moment). I would suggest logging the deletions with (FedID, Timestamp, Tag Type, Tag Name, Observation) but if this requires significant development we can defer until Frank confirms.

kaliif commented 3 months ago

By this I meant that if a user changes the alias for an XCA generated tag, i.e. 3 - His71 to 3 - Allosteric then that should be consistent across the whole document. I.e. 3 - His71 should not appear anywhere else, and nor should another alias for Canonical Site 3. Please also check for the correct syntax: X - YYYYY

@mwinokan since the uploaded file is discarded, the concept of 'consistency across the document' is meaningless. What do I need to look for here? Do you mean the tag names (both auto-generated and aliases) should be unique across a target?

With regards to deleting tags, I am not sure what the precise desire for the spec is (@phraenquex please comment if you have a moment). I would suggest logging the deletions with (FedID, Timestamp, Tag Type, Tag Name, Observation) but if this requires significant development we can defer until Frank confirms.

Is it just tags or do I need to track pose operations as well?

In addition to validating any values in the uploaded Pose column to be referencing an existing observation, please also update the database to reflect any changes to poses such as creation, deletion, reassignment.

This may require some changes to the file. First, the file doesn't specify the main observation for a pose. This may require an additional column(s) or I'll have to assign them randomly. Then, the validation rules (each observation must belong to a pose and each pose must have at least one observation) are difficult to follow in bulk operations. The file has to be exactly correct, if there's a conflict I'll again have to use some randomness (if it is solvable at all) which may not be what the user wants.

mwinokan commented 3 months ago

@kaliif

consistency across the document

Here I mean that in the CanonSites alias column there cannot be 3 - His71 and 3 - Allosteric present. Each CanonSite indicated by its number, should only have one alias.

tracking pose operations

I would suggest we postpone the tracking of these until we have confirmed a spec with @phraenquex in a new ticket

updating database with pose changes

Instead of assigning randomly, can you make the default that the observation the pose is named after becomes the main observation? In the case of a conflict in the validation, I don't think you should resort to a random resolution, I would rather you throw an error message and hopefully it is informative enough to change the CSV to pass validation.

mwinokan commented 3 months ago

@phraenquex says that we want to log the tag changes so that a data curator can retrieve a log of changes to tags (& poses) in the download so that unwanted changes can be undone.

@kaliif says its several days of work (due to outstanding upload work and frequent need for testing) and so should be a separate ticket. @kaliif please investigate the logging in a separate ticket so that we can prioritise

kaliif commented 3 months ago

Example data

Original metadata.csv from A71EV2A staging

metadata_orig.csv

Reordered metadata.csv with added Pose column (from new spec above)

metadata_reordered.csv

Example modified metadata.csv (which does not break any of the rules)

metadata_modified_ok.csv

Example modified metadata.csv (does not break any rules, saved as Excel file)

This one might be useful because I have filled in the cells grey to indicate ones that are not allowed to be modified by the user, and green those that I have modified metadata_modified_ok.xlsx

Example modified metadata.csv (where I have sprinkled in bad data and saved as Excel file)

Again this one has useful cell formatting to indicate the bad data in red metadata_modified_bad.xlsx

@mwinokan do you have the corresponding data tarball to go with those example files? I've tried some A71EV2A that I have and I'm running into conflicts.

mwinokan commented 3 months ago

@kaliif i intended the CSV's to be compatible with the A71EV2A from June, if that doesn't work I might have to recreate the files

mwinokan commented 2 months ago

@kaliif will add the extra columns for short tags as part of green, but the importing of modified metadata.csv's will need to wait for new example data and be part of mint

mwinokan commented 2 months ago

@kaliif please find here some new example files based on A71EV2A on staging. The validation spec should still be as discussed. Please let me know if you have questions.

I have included the python script used to generate the modified data, but do note that for the Excel files I have added the cell colours manually.

A71EV2A_1263_examples.zip

mwinokan commented 2 months ago

Tested today with A71EV2A on staging:

Results in a 500 error:

2024-09-17T09:14:34+0000 django.request.log_response():224 ERROR # Internal Server Error: /api/metadata_upload/
Traceback (most recent call last):
  File "/.venv/lib/python3.11/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.11/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/django/views.py", line 84, in sentry_wrapped_callback
    return callback(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.11/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.11/site-packages/rest_framework/viewsets.py", line 125, in view
    return self.dispatch(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 509, in dispatch
    response = self.handle_exception(exc)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 469, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
    raise exc
  File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/viewer/views.py", line 2582, in create
    errors = load_tags_from_file(filename=filename, target=target)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/viewer/tags.py", line 335, in load_tags_from_file
    if so.pose.display_name != pose_name:
       ^^^^^^^^^^^^^^^^^^^^

@kaliif is investigating

mwinokan commented 2 months ago

@kaliif please try to catch this exception and show a more useful error in the logs/server response

@mwinokan to see which targets are blocking a full staging database wipe which would hopefully resolve this error

mwinokan commented 2 months ago

@kaliif says that a fix has been added for the A71EV2A error. @mwinokan to test

mwinokan commented 2 months ago

The functionality on staging is working. Thanks @kaliif. I will begin to push this to production along with #1501