sanger / sequencescape

Web based LIMS
MIT License
83 stars 32 forks source link

DPL-211 As a Compliance Manager (Catherine M) I would like the country of origin added to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol [due 15/05/23] #3470

Closed TWJW-SANGER closed 1 year ago

TWJW-SANGER commented 2 years ago

User story As a Compliance Manager (Catherine M) I would like the country of origin [and date of collection] to the sample metadata submitted for accessioning so that we comply with the Nagoya protocol before 15/05/2023

UPDATE Initial batch of work has standardised data collection in the manifests, and improved the interface in the rarely used front-end. We do not currently have any additional validation, or send the information to the EBI. These steps should follow

Who are the primary contacts for this story Liz C Catherine M Tom W

Contact Liz C to arrange UAT testing by SSRs

Acceptance criteria To be considered successful the solution must allow:

Dependencies

References See Confluence for details of Accessioning in SequenceScape

Additional context After the end of May 2023 the ENA database will enforce country of origin for all samples submitted This is to comply with the Nagoya Protocol which is an agreement aiming to share the benefits arising from the use of genetic resources in a fair and equitable way

ENA programatic submission documentation Current default sample checklist Latest Details

JamesGlover commented 2 years ago

Validation and collection

Country of origin

Date of collection

JamesGlover commented 2 years ago

Nagoya protocol: https://www.cbd.int/abs/doc/protocol/nagoya-protocol-en.pdf

JamesGlover commented 2 years ago

Data Integrity

Country of origin

6563426 samples have no (NULL) data in the field

There are 471 different values, while a lot of these appear to be valid countries we also have: 1) Nationalities and ethnicities

Very few of them are in block capitals.

I didn't see any fields that may be inadvertently exposing personal data, however it is possible that some of the geographic regions may be small enough that combined with other data an individual would be identifiable. The chances of this occurring increases with the invalid smaller regions, but could potentially be true for countries.

Date of collection

6856299 samples habe no (NULL) data in the field

1) 22661 samples have 0 in the field 2) 2 are blank 3) 20 are 00/01 4) 9 are ? 5) 1 is #NA 6) 1 is ?2010

7) Some incomplete or ambiguous dates like 01-Aug. It is likely that this is a result of Excel auto-typing '01/08' which it 'helpfully' converts to '01-Aug' when in actuality, this was probably intended to refer to Jan 2008 given the requirements in the manifest. I'm a bit concerned that the same may apply to dates in the format 01-Feb-21 as well.

8) Invalid dates like: 0117-11-10 9) I'm a little suspicious of some of the date's I'm spotting. We have 1888 and 1891, and then regular dates from 1900 onwards. However given some of these appear more than once I assume they may represent historical samples. However we also have at lots of samples sample for 2023-44359 which definitely indicates issues with using excel auto-fill. (especially as many of these follow ranges) 10) Very broad dates like 2013

The remainder are mostly what appear to be legitimate dates, but most not in the format suggested in the manifest. In many ways this is a good thing, as it means tightening up our collection and storage of this information will improve reportability, rather than just breaking existing reports. A few actually have timestamps, and while most of these are midnight, a few are more precise

JamesGlover commented 2 years ago

Exposure

Country of origin

Internal

External

Date of collection

Internal

JamesGlover commented 2 years ago

Exceptions

The EBI provide scope for exceptions https://www.ebi.ac.uk/about/news/technology-and-innovation/ena-new-metadata

Although the spatio-temporal information will become mandatory in most cases, some exceptions will be allowed when it is deemed necessary and the exception indicated to users.

JamesGlover commented 2 years ago

ENA Requirements

The current default requirements are available here: https://www.ebi.ac.uk/ena/browser/view/ERC000011 No fields, including geographic data are currently flagged as required on the base checklist.

Country of origin

From the current list the most applicable fields would appear to be geographic location (country and/or sea) which has the following help text:

The geographical origin of the sample as defined by the country or sea. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html).

However currently the options dropdown also contains some non-country options:

This list is accessible programatically via: https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011

The INSDC country list was last updated 'October 31, 2014' so it doesn't seem to be particularly volatile.

Some of the other checklists have stricter requirements. For example the Tree of life checklist is stricter. https://www.ebi.ac.uk/ena/browser/view/ERC000053 and the filed is already flagged as required. (The options list appears to be the same though, not sure if this is true for ALL lists)

Note, the sample XML schema definition doesn't validate individual attributes.

Collection Date

The linked document also mentions a requirement for 'collection date'

This maps to the field: collection_date which has the following help-text:

date the specimen was collected

Validated by a regex:

(^[12][0-9]{3}(-(0[1-9]|1[0-2])(-(0[1-9]|[12][0-9]|3[01])(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?(/[0-9]{4}(-[0-9]{2}(-[0-9]{2}(T[0-9]{2}:[0-9]{2}(:[0-9]{2})?Z?([+-][0-9]{1,2})?)?)?)?)?$)|(^not collected$)|(^not provided$)|(^restricted access$)

This is accessible programatically via: https://www.ebi.ac.uk/ena/browser/api/xml/ERC000011

EGA requirements

I haven't been able to find any guidelines regarding whether the EGA will be affected by these changes. I have reached out to helpdesk for comment.

The EGA have responded:

Thank you for contacting the EGA helpdesk team. At this time, we do not plan to require these fields for metadata submission. We are actively working to improve the metadata on EGA and this review may happen in Q3 of the year. However, at this time it is difficult to estimate if this will be implemented.

JamesGlover commented 2 years ago

Synopsis of progress

Plan of action

Post requirements

JamesGlover commented 2 years ago

Draft RFC

RFC: Proposed changes to multi-lims warehouse sample table

Feedback can be contributed via the github discussion [Link] or directly via email.

In order to improve the value of the data stored within the ENA, and to meet commitments of the Nagoya protocol [1], the EBI will be soon requiring spatio-temporal information for all submitted samples [2]. We currently anticipate that this will cover the 'country_of_origin' and 'date_of_sample_collection' fields as collected in Sequencesscape and presented in the multi-lims warehouse. Neither field is currently sent to the ENA or EGA.

As part of an initial investigation into supporting these requirements we've investigated the validation, persistence and data-integrity of the existing data. And as part of this we anticipate making some changes to the multi-lims warehouse. We hope that ultimately these will improve the quality of the persisted data however they will result in schema changes, and some differences in data.

country_of_origin

This is currently a free-text fields in Sequencescape, however the requirements in the EBI[3] indicate a controlled vocabulary. This list is based on the INSCD country list, although currently also support non-country meta-entities, such as 'not collected' and 'restricted access'.

A brief analysis of data integrity revealed that this field is currently mainly unpopulated. However it also contains several entires that will cease to be valid with the new restrictions. Examples include clearly invalid data such as numbers, non-country geographical regions, such as 'Africa' or 'Forrest of dean' and synonyms such as 'UK' or spelling errors. There are also a large number of cases of the field being used to store nationality, or ethnic background.

There are also a cases where it appears that the field has been repurposed to track other non-geographic information, such as containing RNA and IBS, neither of which appear to be valid three letter country codes.

In future we hope this column will match the controlled vocabulary used by the EBI. This change will obviously result in historical data changing, but should hopefully improve the quality of downstream reporting. In cases where it is not possible to unambiguously match data to a valid field, we we hope to consult with the original owners of the sample metadata to provide corrected values. However we expect that it will not be possible in all situations, and in these cases the field will be populated with NULL.

NULL will be used to represent any fields when country_of_origin has not been specified. We welcome any discussion on whether 'not provided', part of the current EBI controlled vocabulary, would be more appropriate.

date_of_collection

This is also currently a free text field in Sequencescape and the multi-lims warehouse. The EBI requirements[2] specify that in future they will require 'The collection date of the sample, recording at least the year of collection.' Currently this data is validated by a regular expression [3].

In future we hope to convert this column to a DATETIME field. We hope that this greatly simplifies any reporting using this field. We've opted for DATETIME over date as some of our existing data has non-midnight timestamps attached, and the EBI supports higher resolution timestamps.

Currently this column is largely unpopulated. However along with obviously invalid data (#N/A, 0) the column contains a range of dates an a variety of formats. Unfortunately is also appear that excel may have resulted in two data integrity issues.

We see several dates in the format '01-Aug', which initially appear to be ambiguous. However if a date is supplied in the MM/YY format the manifest suggests, then Excel converts 01/08 (January 2008) to 01/08/current_year which gets displayed as '01-Aug'. I have some concerns that dates in the format '02-Dec-19' may also be a side effect of this 'helpful' feature.

There is also reason to suspect that some years provided are invalid, as we have collection dates in the future. Given these often follow on consecutively, I suspect this is a side effect of Excel's auto-fill feature.

We hope to migrate all unambiguous dates to the data-time columns, and will work with data owners to try to update any dates which are ambiguous, or may have fallen foul of Excel's data-conversion. And dates that can't be unambiguously migrated, or which were absent, will have a value NULL.

legacy_data

We are keen to receive feedback on whether anyone feels the need to maintain legacy data, and are happy to work out the best ways to achieve this. Where possible it is likely we'll be able to migrate data to other columns (such as 'geographic_region') but we are willing to consider moving data to explicitly 'legacy' columns if absolutely necessary.

References [1] Nagoya Protocol https://www.cbd.int/abs/ [2] EBI notification https://www.ebi.ac.uk/about/news/press-releases/ena-new-metadata [3] EBI Default sample checklist: https://www.ebi.ac.uk/ena/browser/view/ERC000011 [4] INSCD country list https://www.insdc.org/country.html

JamesGlover commented 2 years ago

INSDC Missing Value Reporting Terms

INSDC term (top level) | INSDC term (lower level) | Definition -- | -- | -- not applicable |   | information is inappropriate to report, canindicate that the standard itself fails tomodel or represent the informationappropriately missing | not collected | information of an expected format was notgiven because it has not been collected not provided | information of an expected format was notgiven, a value may be given at the laterstage restricted access | information exists but can not be releasedopenly because of privacy concerns

[](https://ena-docs.readthedocs.io/en/latest/submit/samples/missing-values.html#insdc-missing-value-reporting-terms)INSDC Missing Value Reporting Terms INSDC term (top level) INSDC term (lower level) Definition not applicable information is inappropriate to report, can indicate that the standard itself fails to model or represent the information appropriately missing not collected information of an expected format was not given because it has not been collected not provided information of an expected format was not given, a value may be given at the later stage restricted access information exists but can not be released openly because of privacy concerns
JamesGlover commented 2 years ago

Now I've got the full list pulled down I've found 309 different values which cannot be mapped back to countries from the valid list. I've decided before touching any of the data, including the fairly safe corrections 'UK -> United Kingdom' I'd like to get some of the initial changes out.

I think I'd like to provide a tool to assist with some of the safer, simple corrections.

JamesGlover commented 2 years ago

Having a bit of trouble handling dates:

1) Excel is a bit of a pain, and even setting a column to a date-type allows nonsense input 2) You can get by a bit from this with some validation, such as ensuring a date is < something in the future 3) But we have the difficulty that we want to support low-prevision dates, such as just a year, or a year and a month

And the latter causes issues when reaching Ruby, as the ruby date library doesn't allow non-existing dates. (MySQL does, with the right permissions attached)

I'm leaning towards 'YYYY-MM-DD', but probably as a text field still to allow arbitrary precision.

JamesGlover commented 2 years ago

Checking with the EBI if they mind us redistributing the XML, as it would simplify the process and reduce our load on their systems.

SujitDey2022 commented 1 year ago

Need to identify items for new user story and then can be closed and moved to Done.

TWJW-SANGER commented 1 year ago
emrojo commented 1 year ago

List of tasks Divided in 2 stories:

First stage Strict solution to make it work only with right data, and all wrong historic data will have a default NULL value for these fields:

  • [x] Add all this part inside a feature flag
  • [x] Add country_of_origin and collection date to list of tags for ENA for sample (add in app/models/sample.rb a line include_tag(:country_of_origin) and same for collection date
  • [x] In app/models/accessionable/base.rb class Tag change the label name to use the field names: geographic location (country and/or sea) and collection_date when generating the XML.
  • [ ] In app/models/accessionable/base.rb class Tag add validation so we send null values for country of origin and collection date if it does not match the required regular expressions/list of values.
  • [ ] Update manifest recommendations to match upcoming requirements
  • [ ] Communicate change in requirements with SSRs

Second stage Curate all historic data:

  • [ ] Migrate existing columns to 'legacy_*' versions
  • [ ] Create new DATE column for date_of_sample_collection
  • [ ] Create new sample_metadata_countries table for country names add country_of_origin_record association to sample_metadata, that may have NULL values if the sample metadata doesnt have country.
  • [ ] Populate sample_metadata_countries with current EBI list (Ie. including fields that may be removed)
  • [ ] Test in the ENA dev testing environment and check with them that it is sending it right
  • [ ] Check if this is needed for EGA
TWJW-SANGER commented 1 year ago

Hi, Quick query. There is no business value in curating the historic data prior to this requirement, is the second stage above related to enforcing the strict requirement in the database? And if so, is enforcing the requirements at an application level good enough that we could drop the second stage? Many thanks, Tom

LizCook-ec20 commented 1 year ago

As discussed with @SujitDey2022 , could there be an addition to this story, whereby the mandatory columns in manifests are highlighted in red so it is clear to the service user which columns are mandatory?

emrojo commented 1 year ago

Post talk with Neil and Tom:

We'll send the following flag:

not provided Information of an expected format was not given, a value may be given at the later stage data agreement established pre-2023

for everything that is not after 15/May/2023 and does not match the regular expression for the field.

emrojo commented 1 year ago

How to test the contents of the sample published:

curl -v -X GET <testing_server_url_and_path>/<accession_number> -u "<username>:<password>"
emrojo commented 1 year ago

Some more documentation about the change happening in ENA:

https://ena-docs.readthedocs.io/en/latest/faq/spatiotemporal-metadata.html