We had an example where running tbpore on a single, combined fastq produced difference results to running it on a directory of the individual fastqs that make up the combined one. The reason the results were different between using the combined fastq and the directory of fastqs has to do with subsampling. When we randomly subsample the reads we set a random seed so that the same reads will always be selected. BUT this is only guaranteed if the reads are in the same order. When we combine the reads on the command line we get a differet ordering to when we combine the reads from a directory with python.
To avoid this problem in the future, we will sort the reads with seqkit sort (which is already a dependency) before subsampling
We had an example where running tbpore on a single, combined fastq produced difference results to running it on a directory of the individual fastqs that make up the combined one. The reason the results were different between using the combined fastq and the directory of fastqs has to do with subsampling. When we randomly subsample the reads we set a random seed so that the same reads will always be selected. BUT this is only guaranteed if the reads are in the same order. When we combine the reads on the command line we get a differet ordering to when we combine the reads from a directory with python.
To avoid this problem in the future, we will sort the reads with
seqkit sort
(which is already a dependency) before subsampling