hubmapconsortium / ingest-validation-tests

0 stars 0 forks source link

Check matched FASTQ files for same number of reads/lines #14

Open mruffalo opened 2 years ago

mruffalo commented 2 years ago

Single-cell/nucleus datasets typically contain matched FASTQ files, in groups of 2 for RNA-seq and some ATAC-seq assays, and 3 for other ATAC-seq data types. (RNA-seq contains 2 more files per group, with prefix I, which are not currently used in the analysis).

Processing of these datatypes requires the groups of FASTQ files to match, in that (e.g.,) "the first read in R1 is the barcode + UMI, the first read in R2 is the matched transcript sequence", with "zipped" iteration over the reads in each file.

This crucially requires the number of reads (and therefore lines) in each of the grouped FASTQ files to match. Check this during dataset ingest -- we already check for valid gzip compression, and we should implement the check proposed here so that it doesn't waste CPU and I/O time decompressing the same file twice.

mruffalo commented 2 years ago

Finding grouped FASTQ files is handled in pipelines by https://github.com/hubmapconsortium/fastq-utils, and that should probably be used here too.

jswelling commented 2 years ago

Grouped files are identified by the regex at https://github.com/hubmapconsortium/fastq-utils/blob/main/fastq_utils/__init__.py#L15

jswelling commented 2 years ago

gz_validator.py in this repo is the test which currently tests uncompressability of fastq.gz files.