v1.0.0a5: Various changes to support integration with Proksee Web

"Older" Changes

Fixes software dependency versions in environment.yml and reports those versions in the output.
Reports the correct number of contigs.
Always reports NCBI exclusion criteria.
Changes to agree with new flake8 requirements.

Recent Additions

Updated the refseq_short.csv database using the buildscript and updated that script to allow the changes shown below.
- Added the number of counts / instances of each species assembly. Added to JSON.
- Added the median (50th percentile) for each statistic. Added to JSON.
Update the main control flow of evaluate and assemble to always run both the heuristic species-based evaluation and the NCBI RefSeq exclusion criteria evaluation.
- The JSON files have been updated to show this.
Added a flag in the JSON for whether or not a particular evaluation was performed.
Added an output to the JSON stating whether or not a species was found in the corresponding database.
Reworded the machine learning evaluation's "The probability of the assembly being a good assembly" to "The probability of the assembly being similar to a curated assembly of the same species"
Addressed the issue of the Mash command line argument being too long sometimes.
- Since every contig adds another argument to the command line, the solution was to stop adding new contigs when the line becomes too long and ignore the rest.
- I think this is okay because contamination estimation is already difficult and imprecise (we're only catching the most obvious stuff), the missed contigs will be the smallest ones, and this solution is fast and clean to implement.
Added a test to ensure consistency between version.py and environment.yml.
Minimum contig size can be specified (--min-contig-size).
- This value will be used for QUAST and will be reported in the JSON file.
- The default size is now 1000.
- Two output files are produced: unfiltered contigs and filtered contigs.
- There are two assembly lengths reported in the JSON file (unfiltered and filtered lengths).
Changed JSON output to camel case and made a few other tweaks for simplicity and clarity.

Description of Evaluation Process

There are three evaluations performed by Proksee [command/assemble/evaluate]: a species-based heuristic evaluation; an NCBI RefSeq exclusion criteria-based heuristic evaluation, and a species-based machine learning evaluation.

The species-based heuristic evaluation works by comparing common assembly quality metrics (number of contigs, length, N50, and L50) against a database of curated assembly quality metrics derived from NCBI RefSeq assemblies. If the species is determined with confidence, the evaluation will check to see if each quality metric of the Proksee-pipeline generated assembly falls within an acceptable percentile range when compared to other curated assemblies of the same species.

The NCBI RefSeq exclusion criteria-based heuristic evaluation works be comparing common assembly quality metrics (number of contigs, length, N50, and L50) against RefSeq's exclusion criteria. That is, if the assembly metrics don't meet the thresholds specified by RefSeq, then they will not be accepted into RefSeq.

The species-based machine learning evaluation performs very similarly to our species-based heuristic evaluation, except the assembly quality metrics are considered simultaneously in a machine learning context, rather than evaluating each metric individually.

Won't Do

Adding contamination reports to the JSON file.
- The problem is the contamination estimation / detection is really challenging and works only sometimes. I don't feel comfortable having it prominent in the output right now.
If the user provides the species, do comparisons based on what they provided, but still do estimation and show the estimated species in the output / JSON output.
- I think we should moved this to another release.

Example JSON Files

(Updated 2022-10-13): Please note that they're saved as txt in order to upload to GitHub, but they're all assembly_info.json files.

staph_aureus.txt ERR234657.txt campy.txt

proksee-project / proksee-cmd

v1.0.0a5: Various changes to support integration with Proksee Web #77