uclahs-cds / package-PipeVal

An easy to use CLI tool that can be used to validate different parameters in your NF script/pipeline.
GNU General Public License v2.0
5 stars 1 forks source link

Compression integrity checks #93

Closed yashpatel6 closed 10 months ago

yashpatel6 commented 10 months ago

Description

Adding integrity check option for compressed files.


Test Results

Tested with:

#!/bin/bash
echo "empty BAM"
pipeval validate  test_files/empty_bam.bam

printf "\n"

echo "invalid BAM"
pipeval validate  test_files/invalid.bam

printf "\n"

echo "pass BAM"
pipeval validate  test_files/pass.bam

printf "\n"

echo "BAM with no index"
pipeval validate  test_files/noindex.bam

printf "\n"

echo "Just text file"
pipeval validate  test_files/hello.txt

printf "\n"

echo "Failing checksum MD5"
pipeval validate  test_files/hello_bad_md5.txt

echo "Failing checksum SHA512"
pipeval validate  test_files/hello_bad_sha512.txt

printf "\n"

echo "Generate md5 checksum"
pipeval generate-checksum -t md5 test_files/togen.txt

echo "Generate sha512 checksum"
pipeval generate-checksum -t sha512 test_files/togen.txt

echo "Validate generated checksums"
pipeval validate test_files/togen.txt

rm test_files/togen.txt.md5
rm test_files/togen.txt.sha512

printf "\n"

echo "Valid VCF"
pipeval validate  test_files/test_vcf.vcf.gz

printf "\n"

echo "Valid CRAM"
pipeval validate test_files/valid.cram -r /hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta

printf "\n"

echo "CRAM with no index"
pipeval validate test_files/noindex.cram -r /hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta

printf "\n"

echo "CRAM with default reference"
pipeval validate test_files/default_ref.cram

printf "\n"

echo "Invalid CRAM"
pipeval validate test_files/invalid.cram -r /hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta

printf "\n"

echo "Valid SAM"
pipeval validate test_files/valid.sam

printf "\n"

echo "Valid FASTQ"
pipeval validate test.fq.gz -t

printf "\n"

echo "Truncated FASTQ"
pipeval validate test_13.fastq

printf "\n"

echo "Bad record FASTQ"
pipeval validate test_invalid.fastq.bz2 -t

Checklist

File Commits

[^1]: UCLA Health reaches $7.5m settlement over 2015 breach of 4.5m patient records [^2]: The average healthcare data breach costs $2.2 million, despite the majority of breaches releasing fewer than 500 records. [^3]: Genetic information is considered PHI. Forensic assays can identify patients with as few as 21 SNPs [^4]: RNA-Seq, DNA methylation, microbiome, or other molecular data can be used to predict genotypes (PHI) and reveal a patient's identity.

  To automatically exclude such files using a .gitignore file, see here for example.

Code Review Best Practices

Testing