Closed kevinlibuit closed 9 months ago
Tested on Terra with TheiaProk_FASTA_PHB.
@kevinlibuit is this PR ready for review? :)
Yes, please 🙂
Apologies for the delay! I just launched GAMBIT_Query_PHB test: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/1eb6b443-caad-48c9-9ac8-671c1890a65d
I plan to compare to results from PHB v1.2.1 to verify that only expected taxa receive different results: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/69d5ad1f-2d47-407e-85c1-d3fd580c7734
@cimendes Feel free to test as well!
gambit_db_update_exact_differences_phb_1_3.xlsx All of the genomes that were assigned differently in GAMBIT_Query_PHB between PHB v1.2.1 and this dev branch are expected.
Differences in taxonomic assignment were either 1. from species newly introduced by the automated addition of genomes to the GAMBIT database from GTDB, OR 2. from one of the genera that were manually updated in the GAMBIT database.
However, the manual curation sometimes resulted in a sample losing some specificity in their assignment or changing to a taxonomic assignment that no longer matches the expected species from ATCC. All of the GAMBIT species names are coming from the name assigned by RefSeq and have been curated to remove outliers and species overlaps, meaning these probably just reflect species name disagreements between ATCC and RefSeq that have been introduced in RefSeq in the years since the first GAMBIT database creation. Alternatively, genomes may have been curated from RefSeq in recent years that result in a genus assignment where there was previously a species assignment.
In summary, the changes are intended and I will approve this PR after acceptance of this explanation from others!
Closes #277
:hammer_and_wrench: Changes Being Made
Updating default gambit reference files (signatures and metadata) to v1.3.0 using a Theiagen hosted, requester-pay bucket.
Impacted Workflows/Tasks
:brain: Context and Rationale
Sourcing from a requester-pay, us-central1 bucket will help to reduce egress fees currently incurred from the gs://theiagen-public-files/terra/gambit_files/1.1.0/* files. Also, utilizing the updated database files will help refine taxon mis-calls noted in previous GAMBIT database versions.
:clipboard: Workflow/Task Steps
For the gambit task to function, it requires two references files as input:
gambit_db_genomes
andgambit_db_signatures
. This PR modifies the default inputs to utilize v1.3.0 gambit files that are hosted in a requester-pay, us-central1 GCP bucket rather than the current v1.1.0 files hosted in an open, multi-region GCP bucket.Inputs
The mandatory inputs to the task are
gambit_db_genomes
andgambit_db_signatures
which have been set to"gs://theiagen-public-files/terra/gambit_files/1.1.0/gambit-metadata-1.1-230417.gdb"
and"gs://theiagen-public-files/terra/gambit_files/1.1.0/gambit-signatures-1.1-230417.gs"
, respectively.Outputs
The outputs for this task will remain the same:
Impacted Outputs
By default, the
gambit_db_version
will be impacted. In some instances, the GAMBIT taxon predictions may also differ slightly due to the updates made from GAMBIT db v1.1.0 to v1.3.0:test_tube: Testing
Terra
Scenarios for Reviewer to Test
:microscope: Quality checks
Pull Request (PR) checklist: