rhfogh / mxlims_data_model

Data model / API for crystallography LIMS data
GNU Lesser General Public License v2.1
3 stars 1 forks source link

Entity and attribute names and formats for sample and diffraction plan shipment/upload #4

Open KarlLevik opened 9 months ago

KarlLevik commented 9 months ago

As a starting-point, below is documentation for the CSV format we currently use for this at Diamond.

I imagine we would want to agree on a standard for attribute names as well as a JSON format to replace this.

These are the CSV column names:

oscillationRange,proteinAcronym,proteinName,spaceGroup,sampleBarcode,sampleName,samplePosition,sampleComments,
cell_a,cell_b,cell_c,cell_alpha,cell_beta,cell_gamma,subLocation,loopType,requiredResolution,centringMethod,experimentKind,
radiationSensitivity,energy,userPath,screenAndCollectRecipe,screenAndCollectNValue,sampleGroup

In our actual CSV files, the first line is a header which "dynamically" defines which columns you have and their ordering. So, you can have different columns and ordering for each file, just as long as the column names are ones we know about, and you have included the mandatory columns.

Here is an example - only the three first lines of data - and note that empty columns are ignored:

#proposalCode,proposalNumber,visitNumber,shippingName,dewarCode,containerCode,preObsResolution,neededResolution,oscillationRange,proteinAcronym,proteinName,spaceGroup,sampleBarcode,sampleName,samplePosition,sampleComments,cell_a,cell_b,cell_c,cell_alpha,cell_beta,cell_gamma,subLocation,loopType,requiredResolution,centringMethod,experimentKind,radiationSensitivity,energy,userPath,screenAndCollectRecipe,screenAndCollectNValue,sampleGroup
mx,32101,21,mx32101-23,DLS-MX-0079,MEP-005,,,,GPP91,GPP91,,,GPP91-2059-263C10A,1,,,,,,,,,Litho Loop,,,,,,,,,
mx,32101,21,mx32101-23,DLS-MX-0079,MEP-005,,,,GPP91,GPP91,,,GPP91-2059-263C10B,2,,,,,,,,,Litho Loop,,,,,,,,,
mx,32101,21,mx32101-23,DLS-MX-0079,MEP-005,,,,GPP91,GPP91,,,GPP91-2059-263C10C,3,,,,,,,,,Litho Loop,,,,,,,,,
...

I assume many of the attribute/column names are familiar and self-explanatory, but here is some extra info:

The following fields are mandatory:

Additionally, you can specify flags when you upload the file:

Validation

If not successful, the uploader will abort with an error message. If there was a minor problem, then it will complete but with a warning message.

The warning messages are:

Unable to calculate unit cell volume for sample %s with cell params %s.
Unit cell volume must be positive. Got %s for sample %s with cell params %s
Not setting lab contacts for shipment as the csv file owner %s is not a lab contact for proposal %s.
The csv file owner %s is not in the ISPyB database.

The error messages are:

client is required.

inputcsvfile is required.

file %s not found.

The csv file owner %s is not in the ISPyB database.

If either of the unit cell parameters are defined, then all must be defined. Got %s for sample %s

All unit cell angles must be < 180 degrees. Got %s for sample %s

User-defined field list is missing the following mandatory fields: %s

If uploading the csv file from a visit dir, then the visit's proposal (%s) must match that given in the file (%s).

Authorisation failure - the time delta is too large.

The csv file owner %s is not a member of any sessions/visits in the ISPyB database.

If not uploading the csv file from a visit dir, then you must be a member of a session on the proposal you're trying to upload to (%s).

Illegal characters in sampleGroup %s. Legal characters: alpha-numeric, hyphen and underscore.

The sample group ID %d does not exist

The proposalId of sample group ID %d is different from the proposalId of sample %s

There is already a sample group for proposal %s with name %s

screenAndCollectNValue is not an integer - problem with sampleName %s

screenAndCollectRecipe 'none' requires a value for requiredResolution - sampleName %s

For screenAndCollectRecipe 'best' the screenAndCollectNValue must be from 1 to 5 - problem with sampleName %s

For screenAndCollectRecipe 'best' a sampleGroup is required - problem with sampleName %s

screenAndCollectRecipe 'all' requires a value for neededResolution - problem with sampleName %s

'%s' not a valid screenAndCollectRecipe - problem with sampleName %s
Mandatory field %s not filled in. (Only mandatory for first row.) Required format is: %s

Mandatory field %s not filled in. Required format is: %s

Field %s must be max 45 characters long, this value is longer: %s

Illegal characters in sampleName %s. Legal characters: alpha-numeric, hyphen and underscore.
Space group must be at least 2 characters long or be a positive integer: %s

Space group number must be in the range [1, 230]: %s

The dewar code %s is not a registered facility code for proposal %s

The container code %s is not a registered container code

The userPath can be max 100 characters long, this one is longer: %s

The proteins must have been approved - this one isn't: acronym: %s

The proteins must already exist in ISPyB - this one doesn't: acronym: %s

Sample with name %s already exists for protein with acronym %s in this proposal.

Value required for experimentKind when UDC/queueContainer option specified. No value found for sampleName %s

Sample %s in container %s is in an invalid location %s. Valid locations are 1 to 16.

Sample %s in container %s has an invalid non-integer location %s

Sample %s in container %s is in an invalid sub-location %s. Valid locations are 0 to 7.

Sample %s in container %s has location %s, sub-location %s which is already taken.

Project %s does not exist

There are %d occurrences of sample with name %s and protein acronym %s in this CSV file.
katesmith280 commented 9 months ago

Thanks Karl for your very comprehensive starting point!

Prior to the SLS darktime this is what we our users could provide prior to their experiment (by email): V6_TELLSamplesSpreadsheetTemplate.xlsx

Our website heidi.psi.ch allowed users to validate their spreadsheets prior to emailing them to us. Our desktop sample changer GUI would also run the same sample import validation when the spreadsheet is uploaded prior to an experiment.

Pydantic model: (https://github.com/HeidiProject/backend/blob/main/app/sample_models.py) Sample importer module: (https://github.com/HeidiProject/backend/blob/main/app/sample_importer.py)

ejd53 commented 9 months ago

What I like about both of these is that the column names appear to be scientist-friendly and completely decoupled from those in the database :)

Here's some JSON Schema for a previous attempt at a one-shot shipment submission, intended to encompass both pin and plate shipments as well as retrieval of crystal coordinates when putting a plate onto a home source: https://icebear.fi/shiplink/v0_3_0/schema.json

(Karl, you might remember this one, back in the day...)

A more human-friendly representation is here: https://icebear.fi/shiplink/schemadoc/?schema=https://icebear.fi/shiplink/v0_3_0/schema.json

Some of this doesn't make any sense to me after not having seen it for a few years, and there's some stuff missing, but nothing fundamentally wrong with it as far as I can see.

antolinos commented 8 months ago

Hi,

Our column names are pretty similar to what Karl has described with some minor differences. The csv can be downloaded from here

Parameters

Parameter Description
parcel name
container name
container name
container type
container position
protein acronym
sample acronym
barcode pin barcode
SPG
cellA
cellB
cellC
cellAlpha
cellBeta
cellGamma
experimentType this is the name of the workflow: MXPress-A, etc...
aimed Resolution
required Resolution
beam diameter
number of positions
aimed multiplicity
aimed completeness
forced SPG
radiation sensitivity
smiles
total rot. angle
min osc. angle
observed resolution
comments

Currently, we are adding more parameters from online data analysis, but it is still in a very immature state.

hormiai76 commented 8 months ago

Hi, at MAXIV we added 4 more columns to the ESRF ones. We need them to manage the unattended data collections:

example_MAXIV.csv

We are working to a new tempalte in Excel to apply some restrictions to the diffraction plan columns and then the user will need to export the file as csv and import it into py-ispyb-ui or exi

CV-GPhL commented 8 months ago

Maybe too early, but a few comments about some of those items - also mainly to show the kind of connection one could do between the some items here and a dictionary like PDBx/mmCIF (the definitions there are also not perfect in some places, but it seems the best we have and is actively developed and maintained).

"aimed Resolution" and "required Resolution":

"aimed multiplicity":

"aimed completeness":

Nothing mentioned above has any impact right now - apart from maybe a renaming of "Resolution" ;-)

antolinos commented 8 months ago

Hi @CV-GPhL

I remember discussing 'aimed resolution' and 'required resolution' for quite a long time in a recent meeting. It was also mentioned the word 'desired'.

I have no say about this. My opinion, at this stage of the project, is to encourage more scientists to participate in the discussions. I've tried to involve some at the ESRF with little (or zero) success

Is there a style guide about upper/lower-casing ("Resolution" vs "multiplicity")?

At least in my case, I have just copied and pasted what we have in the CSV example template. It is only for listing purposes. This should not be considered as the final name that will be used to define the metadata in the catalog, where I presume each implementation will have its own styles.

CV-GPhL commented 8 months ago

@antolinos,

As I said, this kind of discussion is maybe a bit too early (and others might join in at later stages). What is important is that a discussion about the "proper" (whatever that means) scientific definition of various categories has to happen before anything goes into production. At the moment we shouldn't really care what a box is called - it's just a name after all with only a very rough meaning.