Open mruffalo opened 2 years ago
Finding grouped FASTQ files is handled in pipelines by https://github.com/hubmapconsortium/fastq-utils, and that should probably be used here too.
Grouped files are identified by the regex at https://github.com/hubmapconsortium/fastq-utils/blob/main/fastq_utils/__init__.py#L15
gz_validator.py in this repo is the test which currently tests uncompressability of fastq.gz files.
Single-cell/nucleus datasets typically contain matched FASTQ files, in groups of 2 for RNA-seq and some ATAC-seq assays, and 3 for other ATAC-seq data types. (RNA-seq contains 2 more files per group, with prefix I, which are not currently used in the analysis).
Processing of these datatypes requires the groups of FASTQ files to match, in that (e.g.,) "the first read in R1 is the barcode + UMI, the first read in R2 is the matched transcript sequence", with "zipped" iteration over the reads in each file.
This crucially requires the number of reads (and therefore lines) in each of the grouped FASTQ files to match. Check this during dataset ingest -- we already check for valid
gzip
compression, and we should implement the check proposed here so that it doesn't waste CPU and I/O time decompressing the same file twice.