theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Update Gambit database files to version 1.3.0 #292

Closed kevinlibuit closed 9 months ago

kevinlibuit commented 9 months ago

Closes #277

:hammer_and_wrench: Changes Being Made

Updating default gambit reference files (signatures and metadata) to v1.3.0 using a Theiagen hosted, requester-pay bucket.

Impacted Workflows/Tasks

:brain: Context and Rationale

Sourcing from a requester-pay, us-central1 bucket will help to reduce egress fees currently incurred from the gs://theiagen-public-files/terra/gambit_files/1.1.0/* files. Also, utilizing the updated database files will help refine taxon mis-calls noted in previous GAMBIT database versions.

:clipboard: Workflow/Task Steps

For the gambit task to function, it requires two references files as input: gambit_db_genomes and gambit_db_signatures. This PR modifies the default inputs to utilize v1.3.0 gambit files that are hosted in a requester-pay, us-central1 GCP bucket rather than the current v1.1.0 files hosted in an open, multi-region GCP bucket.

Inputs

The mandatory inputs to the task are gambit_db_genomes and gambit_db_signatures which have been set to "gs://theiagen-public-files/terra/gambit_files/1.1.0/gambit-metadata-1.1-230417.gdb" and "gs://theiagen-public-files/terra/gambit_files/1.1.0/gambit-signatures-1.1-230417.gs", respectively.

Outputs

The outputs for this task will remain the same:

  output {
    File gambit_report_file = report_path
    File gambit_closest_genomes_file = closest_genomes_path
    String gambit_predicted_taxon = read_string("PREDICTED_TAXON")
    String gambit_predicted_taxon_rank = read_string("PREDICTED_TAXON_RANK") 
    String gambit_next_taxon = read_string("NEXT_TAXON")
    String gambit_next_taxon_rank = read_string("NEXT_TAXON_RANK")
    String gambit_version = read_string("GAMBIT_VERSION")
    String gambit_db_version = read_string("GAMBIT_DB_VERSION")
    String merlin_tag = read_string("MERLIN_TAG")
    String gambit_docker = docker
  }

Impacted Outputs

By default, the gambit_db_version will be impacted. In some instances, the GAMBIT taxon predictions may also differ slightly due to the updates made from GAMBIT db v1.1.0 to v1.3.0

:test_tube: Testing

Terra

Scenarios for Reviewer to Test

:microscope: Quality checks

Pull Request (PR) checklist:

kevinlibuit commented 9 months ago

Tested on Terra with TheiaProk_FASTA_PHB.

cimendes commented 9 months ago

@kevinlibuit is this PR ready for review? :)

kevinlibuit commented 9 months ago

Yes, please 🙂

michellescribner commented 9 months ago

Apologies for the delay! I just launched GAMBIT_Query_PHB test: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/1eb6b443-caad-48c9-9ac8-671c1890a65d

I plan to compare to results from PHB v1.2.1 to verify that only expected taxa receive different results: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/69d5ad1f-2d47-407e-85c1-d3fd580c7734

@cimendes Feel free to test as well!

michellescribner commented 9 months ago

gambit_db_update_exact_differences_phb_1_3.xlsx All of the genomes that were assigned differently in GAMBIT_Query_PHB between PHB v1.2.1 and this dev branch are expected.

Differences in taxonomic assignment were either 1. from species newly introduced by the automated addition of genomes to the GAMBIT database from GTDB, OR 2. from one of the genera that were manually updated in the GAMBIT database.

However, the manual curation sometimes resulted in a sample losing some specificity in their assignment or changing to a taxonomic assignment that no longer matches the expected species from ATCC. All of the GAMBIT species names are coming from the name assigned by RefSeq and have been curated to remove outliers and species overlaps, meaning these probably just reflect species name disagreements between ATCC and RefSeq that have been introduced in RefSeq in the years since the first GAMBIT database creation. Alternatively, genomes may have been curated from RefSeq in recent years that result in a genus assignment where there was previously a species assignment.

In summary, the changes are intended and I will approve this PR after acceptance of this explanation from others!