Open phraenquex opened 10 months ago
@phraenquex is this still needed? As we discussed today the bulk tag assignment options are not terrible...
@phraenquex and @mwinokan discussed that data curators need a way to edit tags in bulk outside of Fragalysis. Whether in Excel or programmatically. Can @kaliif's mechanisms for exporting and importing tags have a frontend UI so that data curators can download a CSV of all their tags and re-upload them?
@mwinokan I need an example tag upload file.
@kaliif what is the current format of the tag export you implemented? I will think about a spec, but it would help if you could briefly comment on how difficult it is to parse a CSV and modify entries for the following:
@mwinokan example attached. Basically, generated tags in their own column, curated tags (no such in the file) in separate columns, and a boolean flag in the cell to indicate presence. The column '[Other] New' is a category that was added later, that might be a bug..
Parsing a csv is not difficult but obviously, it needs to be clear and unequivocal schema. For importing in the backend I could very well work with key-value pairs but understandably this is not very human-friendly.
Not sure what you mean by modifying entries, do you need me to change something in the backend?
Thanks @kaliif, could we support the following pipeline:
Until an observation can belong to only one pose and pose has no tags (or just one tag) this table structure will work just fine.
Compound aliases will be difficult to fit in, there's already one-to-many in the file between cite observation and compound, this may be confusing for users to edit. And if there's any metadata to be associated with aliases the input structure would be even more complex. I suspect this needs to be a separate file. I can create a stub in the download, maybe something like:
| compound_id | alias type 1 | ... | alias type n |
|-------------+--------------+-----+--------------|
| 1 | | | |
Would that work?
@kaliif yes I think that works great, thanks
@mwinokan do you have the list of alias types you'd like to use? Ideally, I'd map them to database columns and then they'd need to be known strings. I could of course link them via m2m table and have the database generate types on the fly but I've seen many times how quickly this gets unmanageably messy so I'd rather use the columns.
@mwinokan to talk to @phraenquex and see if on-the-fly alias type creation is necessary
@kaliif I'm checking with data curators on Slack (trying to avoid on-the-fly alias support)
It'll certainly be necessary for alias types to be added in the middle of a project.
But on reflection, that shouldn't be done magically during upload, you'd want them added actively, and the upload to be rejected (with informative message!) if it has unregistered types. (Else you quickly accumulate typos that create infuriating UI behaviours.)
So the main question to resolve, is what's the mechanism of adding types - a simple front-end modal might do the job, accessible from the hamburger menu.
@phraenquex from the Slack discussion it seems like we may be fine with just two alias types. @kaliif said in the meeting that it would be much simpler not support on-the-fly alias type creation.
@mwinokan I'm not sure the Slack discussion pinned down the use-case (from what I can tell).
Copying from the Slack:
with "type" I had in mind things like [ASAP, CDD, Enamine, ZINC, READDi, SGC, CMD, etc] With each compound possibly having an alias for each type. So definitely dynamic and specific per project. (If this is already covered in the spec, then ignore. But I could not work out what we would practically do with
catalogue
andproject
.)
@phraenquex I think it would be overkill (and not the best use of our budget) to support any number of arbitrary alias types in the backend and frontend. I think the having two aliases is sufficient for all the projects I have been involved in:
catalogue
captures either the Zinc-ID or supplier ID.project
would capture the ASAP/READDi ID depending on the projectHaving these two also makes the hierarchy clear: display the project
alias is one is available, else display the catalogue
alias.
CDD uploads for example use the ASAP ID anyway, I don't think it's best practice for us to promote the use of so many different monikers of the same compound.
No, in general, compounds should be able to have arbitrary many aliases. (Even if in practice it might never be more than 2 or 3.)
And in the F/E, you need to be able to change which aliases ("alias type") are being shown.
Hence my spec.
If that creates problems, please spell out what they are - and how it explodes the spec.
@phraenquex for the arbitrary aliases we would need:
This is on top of the work to:
Your call
@phraenquex says: the same compound may have different names in different contexts, e.g. for ASAP there are Enamine/Postera/CDD ID's for each compound, that must be captured in Fragalysis.
However the alias class/type should be a validation step, on upload there must be a way to specify the types of aliases relevant to the target/project that future CSV uploads will be validated against. This will require UI at upload, and once the data is live in the f/e as well.
This impacts searching for compounds as well, we will likely need to revisit how observations can be searched via their various aliases (and in future other properties).
@tdudgeon wonders whether the aliases will need to be intelligently applied across targets where compounds are reused. @phraenquex says that the aliases should be kept only to the target.
@kaliif please implement the spec discussed earlier in this ticket regarding the bulk tag assignment. We will continue the discussion regarding the aliases in #1412
@mwinokan can you give me an example file that contains the data for the simplified spec. Also, how would the upload work? Does it require a new endpoint or is it going to be using one of the existing mechanisms?
@Waztom will review csv and get data curators feedback re:
@Waztom @kaliif I can answer all those:
My expectation would be that a user would create a "tag file" and continually update it and then want to reload that data. Therefore it would seem necessary to have an option to wipe existing tags and reload the updated file.
I've been working with the file metadata.csv
from the downloads as the base file. On the first download, this is populated with existing tags - generated tags are in their columns by category with tag name in the cell, curated tags are in their own column and boolean value in the cell marks presence for the site observation. This template can be easily expanded with new tags.
@tdudgeon @kaliif It's critical that tag creation and editing in the frontend is fully respected. So yes, a data uploader might have their tag file and upload that every time; but that can't step on things done in the frontend.
@Waztom, you and I should pin down a workflow spec today, so that @kaliif and @matej-vavrek can stop having to guess.
(I'm pretty sure curators won't have strong views on the precise workflow, they just need it to be completely predictable, and well documented.)
@kaliif is it a bug or feature that some tag columns in metadata.csv
are TRUE/FALSE, and others 0/1?
@phraenquex feature. I assumed users fill this out by hand and use different ways to mark the tags.
@kaliif and @tdudgeon spoke to Frank yesterday and came up with the following spec:
@kaliif and @tdudgeon spoke to Frank yesterday and came up with the following spec:
1. The Tag file will update/overwrite all tags - it is the responsibility of the data curator to edit tags with the understanding that whatever they use will overwrite existing tags.
What is meant by update and overwrite?
2. The Tag file will be the metadata.csv with three additional columns ("Curated tag", "Tag edited user", "Tag edited date")
Need an example file.
3. Backend tracks when Tags are changed and the User responsible for bulk uploading and frontend tag changes. @phraenquex we spoke about updating a file with this info everytime a Tag is changed. Alternative is to capture Tag before/after state only. Editing one tag at a time on the frontend will generate too many files? Data curator also likely most interested in most recent edits?
Same as 1., how do you imagine changing the tags?
@kaliif sorry for the behemoth spec, please let me know if you have any questions or concerns!
@Waztom and I discussed what is now hopefully the final spec for this ticket.
The general workflow that should be supported for any deployed target:
The metadata.csv should remain largely unchanged, except the order of the columns. This is to simplify the communication of the validation logic on which columns are editable by the user. The order should be:
Code, Long code, Experiment code, Compound code, Smiles, Centroid res, Downloaded
as it does now.ConformerSites upload name, CanonSites upload name, CrystalformSites upload name, Quatassemblies upload name, Crystalforms upload name
ConformerSites alias, CanonSites alias, CrystalformSites alias, Quatassemblies alias, Crystalforms alias
[Other] New, [Other] S1' site, [Other] P2 Wave 1, [Other] S2 site, [Other] Covalent, [Other] S3 site, [Other] S4 site, [Other] S5 site, [Other] Fragment, [Other] S1 site
With this new layout of the metadata.csv the columns will be split into three logical groups:
[Type] Name
, and the database should be updated to reflect exactly what's in the edited file (e.g. existing tags newly assigned to observations, removal of curator tags, addition of new curator tags)A new upload endpoint and minimal django UI will be needed to allow the user to upload a modified CSV:
metadata.csv
Some suggested validation rules (which should have sensible error messages):
[Type] Name
and the Type should be one of the supported Forum, Series, Other
False
This one might be useful because I have filled in the cells grey to indicate ones that are not allowed to be modified by the user, and green those that I have modified metadata_modified_ok.xlsx
Again this one has useful cell formatting to indicate the bad data in red metadata_modified_bad.xlsx
@mwinokan @Waztom I've read it now and (not having started working on this yet) I'd say I understand all the points here. And it seems to me (again, not having started actual work yet) that there's nothing terribly complicated; it's a fair amount of work, but I'd say we're talking about days here, not weeks.
But I'd like to point out that you're asking the users to fill out big and complex spreadsheet by hand and in my experience, this is going to be a pain. People are just not very good at this. Yes, I can run a validation on the file and send the user feedback on what's wrong but, for example, [Othrr]
is not a known category, so I'll have to throw an error here. And only then, when the user fixes this to [Other]
can I move on to all the creative ways they've filled out the cells. So there's likely going to be several attempts to upload with frustration rising with each round.
I don't have a problem taking this task on, but I think you should discuss it with the front-end guys about solving it there. I've done enough front-end work to know it's going to be a pain to implement but with this level of user input needed, it seems to me this is the correct place to do this.
Thanks @kaliif, glad to hear that the b/e work seems achievable on a timescale of days.
Regarding whether this should be a f/e feature instead, we did think about this, but as you say it is a significant amount of work to essentially rebuild a spreadsheet software in Fragalysis. This CSV I/O feature would be for power users who are either already quite meticulous with their spreadsheets (e.g. Jasmin & Ryan) and the functionality will be very useful for computational folk (like myself) who will add tags algorithmically. The default advice for users will be to use the frontend and carefully modify individual observations.
So please go ahead with the spec
@mwinokan I've been working on this ticket for a couple of days now and some questions have risen:
The aliases for a given XCA tag upload name should be consistent across the file
I don't understand what this means
Each values of the pose column should match an observation shortcode in the data
Is this just for validation or does this mean the upload should also trigger pose operations (create, delete, reassign)?
Track timestamp and fed ID of any modifications in the database
What does this mean for deleting a tag? Is it necessary to log this operation? If yes, what information about the tag needs to be retained?
Thanks @kaliif
You mentioned several times in the spec that the user should not be allowed to change the first 12 columns.
I think it's fine to ignore those columns for now. As long as the short / long observation code can still be used to match to the relevant DB entries.
The aliases for a given XCA tag upload name should be consistent across the file
By this I meant that if a user changes the alias for an XCA generated tag, i.e. 3 - His71
to 3 - Allosteric
then that should be consistent across the whole document. I.e. 3 - His71
should not appear anywhere else, and nor should another alias for Canonical Site 3. Please also check for the correct syntax: X - YYYYY
Each values of the pose column should match an observation shortcode in the data
In addition to validating any values in the uploaded Pose column to be referencing an existing observation, please also update the database to reflect any changes to poses such as creation, deletion, reassignment.
Track timestamp and fed ID of any modifications in the database
With regards to deleting tags, I am not sure what the precise desire for the spec is (@phraenquex please comment if you have a moment). I would suggest logging the deletions with (FedID, Timestamp, Tag Type, Tag Name, Observation)
but if this requires significant development we can defer until Frank confirms.
By this I meant that if a user changes the alias for an XCA generated tag, i.e. 3 - His71 to 3 - Allosteric then that should be consistent across the whole document. I.e. 3 - His71 should not appear anywhere else, and nor should another alias for Canonical Site 3. Please also check for the correct syntax: X - YYYYY
@mwinokan since the uploaded file is discarded, the concept of 'consistency across the document' is meaningless. What do I need to look for here? Do you mean the tag names (both auto-generated and aliases) should be unique across a target?
With regards to deleting tags, I am not sure what the precise desire for the spec is (@phraenquex please comment if you have a moment). I would suggest logging the deletions with (FedID, Timestamp, Tag Type, Tag Name, Observation) but if this requires significant development we can defer until Frank confirms.
Is it just tags or do I need to track pose operations as well?
In addition to validating any values in the uploaded Pose column to be referencing an existing observation, please also update the database to reflect any changes to poses such as creation, deletion, reassignment.
This may require some changes to the file. First, the file doesn't specify the main observation for a pose. This may require an additional column(s) or I'll have to assign them randomly. Then, the validation rules (each observation must belong to a pose and each pose must have at least one observation) are difficult to follow in bulk operations. The file has to be exactly correct, if there's a conflict I'll again have to use some randomness (if it is solvable at all) which may not be what the user wants.
@kaliif
Here I mean that in the CanonSites
alias column there cannot be 3 - His71
and 3 - Allosteric
present. Each CanonSite
indicated by its number, should only have one alias.
I would suggest we postpone the tracking of these until we have confirmed a spec with @phraenquex in a new ticket
Instead of assigning randomly, can you make the default that the observation the pose is named after becomes the main observation? In the case of a conflict in the validation, I don't think you should resort to a random resolution, I would rather you throw an error message and hopefully it is informative enough to change the CSV to pass validation.
@phraenquex says that we want to log the tag changes so that a data curator can retrieve a log of changes to tags (& poses) in the download so that unwanted changes can be undone.
@kaliif says its several days of work (due to outstanding upload work and frequent need for testing) and so should be a separate ticket. @kaliif please investigate the logging in a separate ticket so that we can prioritise
Example data
Original metadata.csv from A71EV2A staging
Reordered metadata.csv with added Pose column (from new spec above)
Example modified metadata.csv (which does not break any of the rules)
Example modified metadata.csv (does not break any rules, saved as Excel file)
This one might be useful because I have filled in the cells grey to indicate ones that are not allowed to be modified by the user, and green those that I have modified metadata_modified_ok.xlsx
Example modified metadata.csv (where I have sprinkled in bad data and saved as Excel file)
Again this one has useful cell formatting to indicate the bad data in red metadata_modified_bad.xlsx
@mwinokan do you have the corresponding data tarball to go with those example files? I've tried some A71EV2A that I have and I'm running into conflicts.
@kaliif i intended the CSV's to be compatible with the A71EV2A from June, if that doesn't work I might have to recreate the files
@kaliif will add the extra columns for short tags as part of green, but the importing of modified metadata.csv's will need to wait for new example data and be part of mint
@kaliif please find here some new example files based on A71EV2A on staging. The validation spec should still be as discussed. Please let me know if you have questions.
I have included the python script used to generate the modified data, but do note that for the Excel files I have added the cell colours manually.
Tested today with A71EV2A on staging:
Results in a 500 error:
2024-09-17T09:14:34+0000 django.request.log_response():224 ERROR # Internal Server Error: /api/metadata_upload/
Traceback (most recent call last):
File "/.venv/lib/python3.11/site-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
^^^^^^^^^^^^^^^^^^^^^
File "/.venv/lib/python3.11/site-packages/django/core/handlers/base.py", line 181, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.venv/lib/python3.11/site-packages/sentry_sdk/integrations/django/views.py", line 84, in sentry_wrapped_callback
return callback(request, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.venv/lib/python3.11/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.venv/lib/python3.11/site-packages/rest_framework/viewsets.py", line 125, in view
return self.dispatch(request, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 509, in dispatch
response = self.handle_exception(exc)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 469, in handle_exception
self.raise_uncaught_exception(exc)
File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
raise exc
File "/.venv/lib/python3.11/site-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/code/viewer/views.py", line 2582, in create
errors = load_tags_from_file(filename=filename, target=target)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/code/viewer/tags.py", line 335, in load_tags_from_file
if so.pose.display_name != pose_name:
^^^^^^^^^^^^^^^^^^^^
@kaliif is investigating
@kaliif please try to catch this exception and show a more useful error in the logs/server response
@mwinokan to see which targets are blocking a full staging database wipe which would hopefully resolve this error
@kaliif says that a fix has been added for the A71EV2A error. @mwinokan to test
The functionality on staging is working. Thanks @kaliif. I will begin to push this to production along with #1501
Need a table/spreadsheet modal to allow rigorous tag assignment.
Include "create tag" widget from Tag details panel.