Closed anna-parker closed 2 months ago
We should use PHAGE's mappings:
ENA Virus Checklist Field PHA4GE Field
subject exposure exposure event
subject exposure duration NULL
type exposure exposure event
personal protective equipment NULL
hospitalisation specified as value under host health status
illness duration NULL
illness symptoms signs and symptoms
collection date sample collection date
geographic location (country and/or sea) geo_loc (country)
geographic location (latitude) geo_loc latitude
geographic location (longitude) geo_loc longitude
geographic location (region and locality) geo_loc (state/province/region)
sample capture status purpose of sampling
host disease outcome host disease outcome
host common name host (common name)
host subject id host subject ID
host age host age
host health state host health state
host sex host gender
host scientific name host (scientific name)
virus identifier specimen collector sample ID
collector name NULL
collecting institution sample collected by
receipt date received date
sample storage conditions NULL
definition for seropositive sample NULL
serotype (required for a seropositive sample) NULL
isolate isolate
strain NULL
host habitat NULL
isolation source host-associated anatomical material; anatomical part; body product
host description NULL
gravidity NULL
host behaviour NULL
isolation source non-host-associated environmental site; environmental material
Converted to loculus's mapping - fields with an (*) have required modifications as loculus has slightly different metadata fields than PHAGE:
metadata_mapping:
'subject exposure':
loculus_fields: [exposure_event]
'type exposure':
loculus_fields: [exposure_event]
hospitalisation:
loculus_fields: [host_health_state]
function: match
args: [Hospital]
'illness symptoms':
loculus_fields: [signs_and_symptoms]
'collection date':
loculus_fields: [sample_collection_date]
'geographic location (country and/or sea)':
loculus_fields: [geo_loc_country]
'geographic location (region and locality)':
loculus_fields: [geo_loc_admin_1]
'sample capture status':
loculus_fields: [purpose_of_sampling]
'host disease outcome':
loculus_fields: [host_health_outcome]
'host common name':
loculus_fields: [host_name_common]
'host age':
loculus_fields: [host_age]
'host health state':
loculus_fields: [host_health_state]
'host sex':
loculus_fields: [host_gender]
'host scientific name':
loculus_fields: [host_name_scientific]
*'isolate':
loculus_fields: [specimen_collector_sample_id]
*'collecting institution':
loculus_fields: [sequenced_by_organization, author_affiliations]
'receipt date':
loculus_fields: [received date]
'isolation source host-associated':
loculus_fields: [anatomical material, anatomical part, body product]
'isolation source non-host-associated':
loculus_fields: [environmental site, environmental material]
*'authors':
loculus_fields: [authors]
Note that even when we do not specify a checklist ENA uses the default checklist to check sample metadata: https://www.ebi.ac.uk/ena/browser/view/ERC000011 - thankfully this aligns with our required fields: sample_collection_date and geo_loc_country.
Completed
When we submit to ENA they suggest we use metadata fields that they recognize (this isn't necessary we can just submit with arbitrary fields - but will make life easier for the greater community). The standard metadata template for virus data is https://www.ebi.ac.uk/ena/browser/view/ERC000033. I reviewed the entries they propose and would suggest this mapping of the metadata fields we offer on loculus to ENAs metadata fields.
Proposed Mapping from ENA fields to loculus fields:
I map to most fields in ENA except for the following:
Suggested fields we will ignore
Open Questions:
We are not filling these fields as they do not have a good counterpart in loculus - maybe we should add them to our metadata fields?
Suggested fields maybe we should add?:
We might be able to fill certain fields with more complicated mappings, should we try? We could use the google maps api to get the latitude and longitude from the city name for example and nextclade should give us clade which we could map to strains?:
Suggested fields maybe we can fill?:
strain free text optional
geographic location (latitude) restricted text recommended DD geographic location (longitude) restricted text recommended DD
ENA has collector name and collecting institution - in my suggestion above I map this to our author and author affiliation fields - is this ok? Should we add this field to our metadata?