theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

[Internal - Gambitcore] Downgrade database to stable 1.3.0 version #473

Closed cimendes closed 1 week ago

cimendes commented 1 month ago

This PR closes #474 .

🗑️ This dev branch should NOT be deleted after merging to main.

:brain: Aim, Context and Functionality

This very quick PR changes the default database for gambitcore to v1.3.0

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes (internal workflow)

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: N/A

Databases or database versions changed: Gambit db v2.0.0 -> Gambit db v1.3.0

Data processing/commands changed: N/A

File processing changed: N/A

Compute resources changed: N/A

➡️ Inputs

N/A

⬅️ Outputs

N/A

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

The databases have been previously tested on Terra in this run.

Suggested Scenarios for Reviewer to Test

N/A

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

michellescribner commented 1 month ago

Noting here for documentation: I tested this PR on sample sets from the original GAMBIT publication and noticed that GAMBIT core will report the closest species for a sample even if GAMBIT proper only assigns the sample to the genus level. It looks to be just by nature of how the GAMBIT core tool works - just grabbing the species of the closest genome. Instead, we would like to modify GAMBIT core to report a warning message if GAMBIT can not predict the taxon to the species level.

cimendes commented 1 month ago

Noting here for documentation: I tested this PR on sample sets from the original GAMBIT publication and noticed that GAMBIT core will report the closest species for a sample even if GAMBIT proper only assigns the sample to the genus level. It looks to be just by nature of how the GAMBIT core tool works - just grabbing the species of the closest genome. Instead, we would like to modify GAMBIT core to report a warning message if GAMBIT can not predict the taxon to the species level.

Suggestions being addressed in a separate pull request at https://github.com/gambit-suite/gambitcore/pull/1 Will require a new docker container in this PR for them to take effect. Will update as soon as the PR is merged.

cimendes commented 1 month ago

Converting to draft until a new container for gambitcore is available with the proposed changes

cimendes commented 2 weeks ago

@michellescribner your suggestion has been integrated into the latest container for gambitcore! I've set the PR back to ready for review! :D

michellescribner commented 1 week ago

@cimendes Tested on 88 samples from GAMBIT paper set 1. Samples that could not be identified to the species level failed analysis because "NA" was provided to integer outputs including gambitcore_species_kmers https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/f879d219-fcaa-4b66-a70f-2c0ad6a94f08 image

michellescribner commented 1 week ago

Tested successfully on previous dataset! https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/09ee42c4-135f-4d5c-b889-1b122ebc535c

Samples that could not be identified to the species level were confirmed to result in "NA" outputs, as expected.