Validation data sets - Githubissues

np-core / nanopath

Python package and command line interface - entry point for the repository :snake:

Other

5 stars 0 forks source link

Validation data sets #4

Open esteinig opened 4 years ago

esteinig commented 4 years ago

We should think about validation data sets, maybe just mix them up synthetically? We can get human Fast5 from the reference genome consortium and we could inject a bunch of different pathogen reads at predetermined proportions into the reads. Might need to rebasecall all to standardize.

Maybe at the moment some MRSA and Klebsiella in human will suffice.

esteinig commented 4 years ago

I will make a small script for this in the nanopath package, so we can sample from host and pathogen fastq files.

esteinig commented 4 years ago

This will also be useful to assess the human decontamination on the client side.

esteinig commented 4 years ago

Added compose CLI:

Usage: nanopath utils compose [OPTIONS]

  Compose artificial mixtures by sampling from read files

Options:
  -c, --composition PATH      JSON file, composition configuration
  -o, --output PATH           Output reads file path
  -r, --reads INTEGER         Total reads to sample for the mixture
  -s, --shuffle               Shuffle output reads
  --help                      Show this message and exit.

Example JSON file @ test/data/compose.json

{
  "human": {
    "file": "/data/nanopath/test/human.fq",
    "proportion": 0.90
  },
  "saureus": {
    "file": "/data/nanopath/test/saureus.fq",
    "proportion": 0.05
  },
  "kpneumoniae": {
    "file": "/data/nanopath/test/kpneumoniae.fq",
    "proportion": 0.05
  }
}

esteinig commented 4 years ago

Create a data set for testing:

5% MRSA (ST93, Eike)
5% Klebsiella pneumoniae (GR1220, Miranda)
90% Human (ONT reference, FAB42395)

Basecalled with different models, for real validation we need to rebasecall with same model. This validation data set is meant to be used for on-the-fly testing instead.

esteinig commented 4 years ago

ArtificialMixture class in nanopath/utils.py needs better logging.

esteinig commented 4 years ago

Shuffled test data set at test/data/compare.fq

esteinig commented 4 years ago

Added test Kraken2 output from test/data/compose.fq in f70bd6cee1bac97aabd4792e59f9629256e25133:

test/data/test.report
test/data/test.reads

esteinig commented 4 years ago

Reminder to rebasecall all with standardized HAC models