phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Output "NA" as subtype for samples that fail QC with no subtype result or no targets found #112

Closed glabbe closed 4 years ago

glabbe commented 4 years ago

@dankein noticed that it is not possible to link metadata to the results using the biohansel metadata option if the "subtype" field is empty, as is the case for QC FAIL due to "NO TARGETS FOUND" or "NO SUBTYPE RESULT"). A possible solution would be to output "NA" in the subtype column in these cases to allow metadata to be returned with the results when "NA" is the subtype.

peterk87 commented 4 years ago

Hi @glabbe If it's desirable to have subtype metadata attached to a null result, in the metadata table, you could add an row with an empty cell under the subtype field and whatever metadata values are appropriate in the other fields, e.g.

subtype subtype_metadata
1.1 metadata for 1.1
metadata for null

Could you or @dankein give an example of subtype result metadata that would be returned with a null subtype result?

dankein commented 4 years ago

Hi @peterk87 I tried doing what you suggested with the metadata file before mentioning this issue to @glabbe but it doesn't appear to work in the command line or galaxy versions.

We're using the metadata to reformat the "tech results" into a format that can be pasted line by line into out LIMS system for reporting. This includes version numbers of the scheme, metadata, galaxy tool, as well as custom comments for reports and/ or instructions not to report certain species without repeats... that sort of thing. Below is a partial example of what we're using.

In the case of "No subtype result" we still would like to attach the version metadata and include an instruction to not report the test.

subtype Species differentiation status scheme version metadata version galaxy tool version comments
1 M. tuberculosis differentiation complete 2.1 2.3 2.2
1.1 M. bovis / M. bovis BCG partial differentiation 2.1 2.3 2.2 identification incomplete - repeat sequencing
1.1.1 M. bovis BCG differentiation complete 2.1 2.3 2.2 M. bovis BCG is a vaccine strain.
NA no subtype differentiation Failed 2.1 2.3 2.2 Do not report - no subtype found

Thanks for your help! Dan

peterk87 commented 4 years ago

FYI the development branch version of biohansel (v2.3.0) outputs #N/A for null subtype results (added in PR #81). I'm not sure if this is the version in Galaxy at the moment.

glabbe commented 4 years ago

Thanks for the heads up @peterk87, the Galaxy version of biohansel is still v2.2.0: will need to be updated

glabbe commented 4 years ago

Darian has started a pull request (#152) to update biohansel in Galaxy: https://github.com/phac-nml/galaxy_tools/pull/152#issuecomment-526702246

glabbe commented 4 years ago

@dankein I will talk with @Takadonet in the coming days about how to update the biohansel version in Galaxy to fix this issue

glabbe commented 4 years ago

@peterk87 Actually I just found that the fix implemented in PR #81 only outputs '#N/A' if there is no k-mer match found. If there are only negative k-mers found, and therefore no subtype found, the subtype field is still left blank. See output file attached that I got when using a truncated MTB sequence. MTB_truncated_test.txt Truncated sequence used with the tb_lineage scheme (changed extension to .txt as .fasta is not supported in GitHub): truncated_H37Rv_reference.txt

DarianHole commented 4 years ago

I used dfsummary['subtype'].fillna(value='#N/A', inplace=True) as a way to add #N/A to columns if there was no subtype found. This is in the code after the creation of the dataframe and from what I understand, should fill the column if the cell is blank.

So, I'm going to hazard a guess that when a kmer is found but no subtype is given, the column is filled with a ' ' or something similar that prevents fillna from working in those cases. But I'm not 100% sure on that, just a guess that I must have missed when testing the earlier change (#81).

glabbe commented 4 years ago

@peterk87 @schonfju Justin found a fix: pandas treats an empty string differently from a missing value. @DarianHole Darian's fix handles the case where it's missing value. We also need to handle the case where it's an empty string. Will do pull request, will add the following line under Darian's line in Main.py

dfsummary['subtype'].fillna(value='#N/A', inplace=True) dfsummary['subtype'].replace('','#N/A', inplace=True)

glabbe commented 4 years ago

You were right @DarianHole, the field changes after merging the results with metadata (which happens for the tb_lineage and Typhi schemes by default). If there are no kmer matches found, there is a bypass in subtyper.py, so the results are not merged with the metadata and the dataframe ends up being different than when kmers are found.

DarianHole commented 4 years ago

Merged in #120. Fixes worked for all cases I could think of and that were tested. If something else comes up, reopen or create a new Issue