mbhall88 / tbpore

Mycobacterium tuberculosis genomic analysis from Nanopore sequencing data
MIT License
11 stars 2 forks source link

Sort fastq to ensure reproducibility #48

Closed mbhall88 closed 1 year ago

mbhall88 commented 1 year ago

We had an example where running tbpore on a single, combined fastq produced difference results to running it on a directory of the individual fastqs that make up the combined one. The reason the results were different between using the combined fastq and the directory of fastqs has to do with subsampling. When we randomly subsample the reads we set a random seed so that the same reads will always be selected. BUT this is only guaranteed if the reads are in the same order. When we combine the reads on the command line we get a differet ordering to when we combine the reads from a directory with python.

To avoid this problem in the future, we will sort the reads with seqkit sort (which is already a dependency) before subsampling