Fixes and QoL Additions

phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes

Apache License 2.0

26 stars 7 forks source link

Fixes and QoL Additions #30

Closed mgopez closed 6 years ago

mgopez commented 6 years ago

1) Adding fastqsanger to the list of fastq regex patterns.

This is needed as there were cases where fastqsanger wasn't being considered as a fastq file, while lead to kmer_cov_freq not being considered. 2) Adding average coverage value column to the subtyping result.
Requested so that a user can check coverage in the results. 3) Fixing order of parameters for blast method.
Incorrect order of parameters was causing the blast method to fail.

Further questions:

Do you think it's a good idea to have coverage present in the subtyping result? Or would you rather this be a new column in the DataFrame for match_results?

peterk87 commented 6 years ago

We could peek at the first X bytes of the file if the file extension is unknown (e.g. is the first non-whitespace character an > then it's fasta, if the first character is @ then it's fastq, if the filename matches .*\.gz$ read the first X uncompressed bytes of the file to determine if it's fasta or fastq). Maybe instead of a regex we could just have a dict of known file extensions to their respective parsing functions or an enum.
In addition to mean coverage, we could also report the median coverage from the frequency of all kmers. I'm wondering if it would be worthwhile to report separate coverage values for when you observe mixed subtyping results. For now it might be best to just show one for the negative tiles.

I think most people would want to see coverage in their results, but definitely in a separate column. I think it might be a good idea to add it to all reports even the tech report since it is valuable QA/QC info on it's own.

mgopez commented 6 years ago

@peterk87 I also added a new flag which adds the ability to output JSON representations of the results files. This is needed for when we populate the tech results table on IRIDA.

mgopez commented 6 years ago

@peterk87 When you say take the average of the negative tiles, why is this the case?

Does the coverage of the negative tiles show how plausible the results are in terms of the higher the coverage of negative tiles, the more certainty you have with the positive hits?

If this is the case, should we just show the median frequency of the negative tiles in the coverage column?

mgopez commented 6 years ago

I spoke with Marisa and Genevieve, and they would like to see that coverage for all tiles is present in

results.tab
tech_results.tab

Then they would like to see a warning when overall tile coverage is < a given value. (Default 20)