phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
26 stars 7 forks source link

Fixes and QoL Additions #30

Closed mgopez closed 6 years ago

mgopez commented 6 years ago

1) Adding fastqsanger to the list of fastq regex patterns.

Further questions:

peterk87 commented 6 years ago
  1. We could peek at the first X bytes of the file if the file extension is unknown (e.g. is the first non-whitespace character an > then it's fasta, if the first character is @ then it's fastq, if the filename matches .*\.gz$ read the first X uncompressed bytes of the file to determine if it's fasta or fastq). Maybe instead of a regex we could just have a dict of known file extensions to their respective parsing functions or an enum.

  2. In addition to mean coverage, we could also report the median coverage from the frequency of all kmers. I'm wondering if it would be worthwhile to report separate coverage values for when you observe mixed subtyping results. For now it might be best to just show one for the negative tiles.

I think most people would want to see coverage in their results, but definitely in a separate column. I think it might be a good idea to add it to all reports even the tech report since it is valuable QA/QC info on it's own.

mgopez commented 6 years ago

@peterk87 I also added a new flag which adds the ability to output JSON representations of the results files. This is needed for when we populate the tech results table on IRIDA.

mgopez commented 6 years ago

@peterk87 When you say take the average of the negative tiles, why is this the case?

If this is the case, should we just show the median frequency of the negative tiles in the coverage column?

mgopez commented 6 years ago

I spoke with Marisa and Genevieve, and they would like to see that coverage for all tiles is present in

Then they would like to see a warning when overall tile coverage is < a given value. (Default 20)