Changes in @only1chunts' new XLSX representation of MIxS

turbomam commented 1 year ago

Note that these revert away from NCBI's structured comment names in several places, presumably to keep the SCN lengths under 20 characters.
See https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/
Where does it say that the SCN's length must be less than N characters? (Ideally the source would be a computable file, not a narrative.)
Also: shouldn't we be consistent in the tokens composed together? See the keyword notes in XXX

Incomplete salinity remodeling?

GSC 6.1: salinity_meth GSC 6.1: MIXS:0000341 GSC 6.1: salinity_meth Reference or method used in determining salinity
GSC 6.1: samp_salinity GSC 6.1: MIXS:0000109

salinity salinity The total concentration of all dissolved salts in a liquid or solid sample. While salinity can be measured by a complete chemical analysis, this method is difficult and time consuming. More often, it is instead derived from the conductivity measurement. This is known as practical salinity. These derivations compare the specific conductance of the sample to a salinity standard such as seawater. measurement value {float} {unit} 25 practical salinity unit X practical salinity unit, percentage 1 MIXS:0000183

only1chunts commented 1 year ago

I'm not entirely sure what all of this ticket means, but one part of it was about seeing the strucutred comment name length written down somewhere, I can answer that bit: Its from the INSDC feature table documentation here https://www.insdc.org/submitting-standards/feature-table/#3.1

3.1 Naming conventions Feature table components, including feature keys, qualifiers, accession numbers, database name abbreviations, and location operators, are all named following the same conventions. Component names may be no more than 20 characters long (Feature keys 15, Feature qualifiers 20) and must contain at least one letter. The following characters are permitted to occur in feature table component names:

Uppercase letters (A-Z)

Lowercase letters (a-z) Numbers (0-9)

Underscore (_)

Hyphen (-)

Single quotation mark or apostrophe (')

Asterisk (*)

It does appear that the BioSamples database are not following the INSDC rules, so perhaps we dont need to either, but thats for a wider discussion as it would be a change in a fundamental policy.

turbomam commented 1 year ago

Thanks, that's clear and helpful.

For compatibility with LinkML, slot/term names should consist only of lowercase letters, numbers and underscores. One could call that lower snake case. Numbers should not appear first.

Using some of those additional characters will break the very most basic LinkML functionality.

Others will make it difficult to generate or understand derived artifacts, like jsonschema and RDF. This is because LinkML casts element names into the legal namespace for each format.

The terms that we have discussed in this matter the most are 16s_recover and 16s_recover_software. They do seem to work with jsonschema. I will try RDF next.

Note that, as of last month, there were no values populated into either the 16s_recover or 16s_recover_software filed in any NCBI Biosample

turbomam commented 1 year ago

I would like to review whether there are slots whose names must contain uppercase letters, like HACCP_term and IFSAC_category. If there's a record that data has already been entered into NCBI with those terms in those cases, I guess that would argue for leaving them as-is.

NCBI's attributes file already casts to lower case:

haccp_term (no values populated in NCBI BiosampleSet as of last month)
ifsac_category (125 unique values populated into ~ 2000 Biosamples)

turbomam commented 1 year ago

The names for classes (checklists, environmental packages, extensions) should be PascalCase aka UpperCamelCase. I have just forced all of those already but we should review the consequences of that. The title can always match something historical and non-PascalCase.

only1chunts commented 1 year ago

as long as the slot label (or whatever we call the long version of the name) can handle the correct case text then I dont think the short names matter so much. Any validation tool we end up making should be capable of understanding differences in case.

turbomam commented 1 year ago

Thanks for the immediate feedback, @only1chunts. Aren't you on holiday?

only1chunts commented 1 year ago

I have tomorrow off, and I'm now about to log-off!

microbiomedata / mixs-6-2-release-candidate

Changes in @only1chunts' new XLSX representation of MIxS #54

50

51

53

Incomplete salinity remodeling?