loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

feat(ena-submission): Decide on (sample) metadata fields mapping #2313

Closed anna-parker closed 2 months ago

anna-parker commented 4 months ago

When we submit to ENA they suggest we use metadata fields that they recognize (this isn't necessary we can just submit with arbitrary fields - but will make life easier for the greater community). The standard metadata template for virus data is https://www.ebi.ac.uk/ena/browser/view/ERC000033. I reviewed the entries they propose and would suggest this mapping of the metadata fields we offer on loculus to ENAs metadata fields.

Proposed Mapping from ENA fields to loculus fields:

"subject exposure":
  - exposure_event
"type exposure":
  function:
    name: concatenate
    loculus_fields:
      - exposure_setting
      - exposure_details
"hospitalisation":
  loculus_fields:
    - environmental_site
  function:
    name: match
    args: Hospital
  type:
    bool
"illness symptoms":
  - signs_and_symptoms  
"collection date":
  - sample_collection_date
"geographic location (country and/or sea)":
  - geo_loc_country
  type:
    - options: country
"geographic location (region and locality)":
  function:
    name: concatenate
    loculus_fields:
      - geo_loc_admin_1
      - geo_loc_admin_2
"host disease outcome":
  - host_health_outcome
  type: 
  - options: [dead, recovered, recovered with sequelae]
"host common name":
  - host_name_common
"host age":
  - host_age
"host health state":
  - host_health_state
  type: 
  - options: [diseased, healthy, not applicable, not collected, not provided, restricted access, ...]
"host sex":
  - host_gender
  type: 
  - options: [female, hermaphrodite, male, neuter, not applicable, not provided, other, ....]
lab_host:
  - is_lab_host
"host scientific name":
  - host_name_scientific
"collector name":
  - author
"collecting institution":
  function:
    name: concatenate
    loculus_fields:
      - sequenced_by_organization
      - author_affiliations
"receipt date":
  - sample_received_date
isolate:
  - specimen_collector_sample_id
"isolation source host-associated":
  function:
    name: concatenate
    loculus_fields:
      - anatomical_material
      - body_product
      - anatomical_part
"host description":
  function:
    name: concatenate
    loculus_fields:
      - host_vaccination_status
      - host_disease
"host behaviour":
  - host_role
"isolation source non-host-associated":
  function:
    name: concatenate
    loculus_fields:
      - environmental_material
      - food_product

I map to most fields in ENA except for the following:

Suggested fields we will ignore

Open Questions:

  1. We are not filling these fields as they do not have a good counterpart in loculus - maybe we should add them to our metadata fields?

    Suggested fields maybe we should add?:

    • definition for seropositive sample free text recommended
    • serotype (required for a seropositive sample) free text recommended
  2. We might be able to fill certain fields with more complicated mappings, should we try? We could use the google maps api to get the latitude and longitude from the city name for example and nextclade should give us clade which we could map to strains?:

    Suggested fields maybe we can fill?:

    strain free text optional
    geographic location (latitude) restricted text recommended DD geographic location (longitude) restricted text recommended DD

  3. ENA has collector name and collecting institution - in my suggestion above I map this to our author and author affiliation fields - is this ok? Should we add this field to our metadata?

anna-parker commented 4 months ago

We should use PHAGE's mappings:

ENA Virus Checklist Field   PHA4GE Field
subject exposure    exposure event
subject exposure duration   NULL
type exposure   exposure event
personal protective equipment   NULL
hospitalisation specified as value under host health status
illness duration    NULL
illness symptoms    signs and symptoms
collection date sample collection date
geographic location (country and/or sea)    geo_loc (country)
geographic location (latitude)  geo_loc latitude
geographic location (longitude) geo_loc longitude
geographic location (region and locality)   geo_loc (state/province/region)
sample capture status   purpose of sampling
host disease outcome    host disease outcome
host common name    host (common name)
host subject id host subject ID
host age    host age
host health state   host health state
host sex    host gender
host scientific name    host (scientific name)
virus identifier    specimen collector sample ID
collector name  NULL
collecting institution  sample collected by
receipt date    received date
sample storage conditions   NULL
definition for seropositive sample  NULL
serotype (required for a seropositive sample)   NULL
isolate isolate
strain  NULL
host habitat    NULL
isolation source host-associated    anatomical material; anatomical part; body product
host description    NULL
gravidity   NULL
host behaviour  NULL
isolation source non-host-associated    environmental site; environmental material
anna-parker commented 4 months ago

Converted to loculus's mapping - fields with an (*) have required modifications as loculus has slightly different metadata fields than PHAGE:

metadata_mapping:
  'subject exposure':
    loculus_fields: [exposure_event]
  'type exposure':
    loculus_fields: [exposure_event]
  hospitalisation:
    loculus_fields: [host_health_state]
    function: match
    args: [Hospital]
  'illness symptoms':
    loculus_fields: [signs_and_symptoms]
  'collection date':
    loculus_fields: [sample_collection_date]
  'geographic location (country and/or sea)':
    loculus_fields: [geo_loc_country]
  'geographic location (region and locality)':
    loculus_fields: [geo_loc_admin_1]
  'sample capture status':
    loculus_fields: [purpose_of_sampling]
  'host disease outcome':
    loculus_fields: [host_health_outcome]
  'host common name':
    loculus_fields: [host_name_common]
  'host age':
    loculus_fields: [host_age]
  'host health state':
    loculus_fields: [host_health_state]
  'host sex':
    loculus_fields: [host_gender]
  'host scientific name':
    loculus_fields: [host_name_scientific]
  *'isolate':
    loculus_fields: [specimen_collector_sample_id]
  *'collecting institution':
    loculus_fields: [sequenced_by_organization, author_affiliations]
  'receipt date':
    loculus_fields: [received date]
  'isolation source host-associated':
    loculus_fields: [anatomical material, anatomical part, body product]
  'isolation source non-host-associated':
    loculus_fields: [environmental site, environmental material]
  *'authors':
    loculus_fields: [authors]
anna-parker commented 4 months ago

Note that even when we do not specify a checklist ENA uses the default checklist to check sample metadata: https://www.ebi.ac.uk/ena/browser/view/ERC000011 - thankfully this aligns with our required fields: sample_collection_date and geo_loc_country.

anna-parker commented 2 months ago

Completed