nextstrain / measles

Nextstrain build for measles virus
https://nextstrain.org/measles
0 stars 6 forks source link

Create custom code to parse Measles strain names #14

Open kimandrews opened 5 months ago

kimandrews commented 5 months ago

As discussed, the WHO requires measles strain names to include the sampling date and geographic location, and in some cases, the strain names could be used to recover dates and/or geographic locations for samples that have empty or ambiguous values for these attributes in the NCBI Datasets program outputs. However, the WHO-formatted strain names do not always appear in the NCBI Datasets output because some GenBank submitters report strain names in the "isolate" field whereas others use the "strain" field, but the NCBI Datasets program only pulls the "isolate" field. The NCBI Datasets team has plans to add the "strain" field sometime this year. After that has been completed, custom code could be written to parse the NCBI Datasets output to do the following for each sample:

  1. Determine whether WHO-formatted strain name is in the "isolate" or "strain" field
  2. Parse date and geographic location from WHO-formatted strain name when these attributes are otherwise empty or ambiguous

This custom code may have minimal impact on the current measles workflow outputs, because very few samples that meet the minimum length requirement (5000bp) have missing dates that could be recovered by this approach. However, if we eventually create gene-specific phylogenies, more samples would be affected. In addition, this code would recover WHO-formatted strain names for many samples (because many samples have strain names in the "strain" field), and there is value in having these strain names present in the metadata retrieved for all samples.