uclahs-cds / package-PipeVal

An easy to use CLI tool that can be used to validate different parameters in your NF script/pipeline.
GNU General Public License v2.0
5 stars 1 forks source link

Update auto file-type detection #64

Closed madisonjordan closed 1 year ago

madisonjordan commented 1 year ago

Checklist

Formatting

File Updates

Docker Hub Auto Build Rules

Docker Image Testing

Test the Docker image with at least one sample. Verify the new Docker image works using:

docker run -u $(id -u):$(id -g) –w <working-directory> -v <directory-you-want-to-mount>:<how-you-want-to-mount-it-within-the-docker> --rm <docker-image-name> <command-to-the-docker-with-all-parameters>

Description

Closes #...

Testing Results

note: the checksum import and checksum function were commented out from validate_file function during testing.

Nextflow Test

So far, tested vcfs with nextflow.

built image from this branch using:

docker build -t pipeval:nf -f docker/Dockerfile .

nextflow command:

nextflow run ./nextflow/main.nf -with-docker pipeval:nf

nextflow scripts located here:

/hot/user/mbjordan/GitHub/public-tool-PipeVal/mbjordan-nf-module/nextflow

nf command logs:

/hot/user/mbjordan/GitHub/public-tool-PipeVal/mbjordan-nf-module/work

Docker

VCF

auto detect check:

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate /hot/software/pipeline/pipeline-call-MutationalSignature/Nextflow/development/input/data/_tests/ARCT/vcf/MSK-AB-0002-T14-1.vcf 
/tool/validate/files.py:9: UserWarning: Warning: file /hot/software/pipel
ine/pipeline-call-MutationalSignature/Nextflow/development/input/data/_te
sts/ARCT/vcf/MSK-AB-0002-T14-1.vcf is not zipped.
  warnings.warn(f'Warning: file {path} is not zipped.')
The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at chr1
...
INFO field at chrX:153954457 .. INFO tag [AS_SB_TABLE=618,455|7,6] expec$ed different number of values (1)
INFO field at chrX:154361480 .. INFO tag [AS_SB_TABLE=851,658|10,8] expe$ted different number of values (1)
INFO field at chrX:154485887 .. INFO tag [AS_SB_TABLE=589,594|9,8] expec$ed different number of values (1)
Input: /hot/software/pipeline/pipeline-call-MutationalSignature/Nextflow$development/input/data/_tests/ARCT/vcf/MSK-AB-0002-T14-1.vcf is valid fi$e-vcf

backwards compatible check using file-vcf

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate -t file-vcf /hot/software/pipeline/pipeline-call-MutationalSignature/Nextflow/development/input/data/_tests/ARCT/vcf/MSK-AB-0002-T14-1.vcf
...
Input: /hot/software/pipeline/pipeline-call-MutationalSignature/Nextflow/
development/input/data/_tests/ARCT/vcf/MSK-AB-0002-T14-1.vcf is valid file-vcf

BAM

autodetect bam check (valid):

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate ./bam_pass/CPCG0196-F1-A-mini-0-RNA.bam
Input: bam_pass/CPCG0196-F1-A-mini-0-RNA.bam is valid file-bam

autodetect bam check (invalid - empty):

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$
 validate ./bam_fail_empty/HG002_N_A-null.bam
Error: bam_fail_empty/HG002_N_A-null.bam pysam bam check failed. No reads in bam_fail_empty/HG002_N_A-null.bam

autodetect bam check (invalid):

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate ./bam_fail_invalid/invalid.bam     
Error: bam_fail_invalid/invalid.bam samtools bam check failed. 'samtools returned with error 4: stdout=, stderr=bam_fail_invalid/invalid.bam was not identified as sequence data.\n'

autodetect bam check (warning on missing index):

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate ./bam_warn_index_missing/CPCG0196-F1-A-mini-0-RNA.bam 
Warning: bam_warn_index_missing/CPCG0196-F1-A-mini-0-RNA.bam pysam bam index check failed. Index file for bam_warn_index_missing/CPCG0196-F1-A-mini-0-RNA.bamcould not be opened or does not exist.
Input: bam_warn_index_missing/CPCG0196-F1-A-mini-0-RNA.bam is valid file-bam

backwards compatible check (of valid file):

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate -t file-bam ./bam_pass/CPCG0196-F1-A-mini-0-RNA.bam
Input: bam_pass/CPCG0196-F1-A-mini-0-RNA.bam is valid file-bam

backwards compatible check (of invalid file):

I have no name!@6937253e22be:/hot/user/abeshlikyan/pipeval_testing_set$ 
validate -t file-bam bam_fail_invalid/invalid.bam
Error: bam_fail_invalid/invalid.bam samtools bam check failed. 'samtools returned with error 4: stdout=, stderr=bam_fail_invalid/invalid.bam was not identified as sequence data.\n'
madisonjordan commented 1 year ago

I just did this because I don't like to put the file type in nextflow and was dreading it for RecSNV. also it appeared that it was detecting the file type anyway so specifying the type manually seemed redundant to me.

madisonjordan commented 1 year ago

Added a few comments. From a Nextflow perspective, we'll still need to maintain the type argument in case users are validating files vs directories. It can just be generic now rather than having to specify the specific file type but the process module will still need two inputs: the path to the file/directory to be validated and the type (either file or directory-r or directory-rw for now)

I was thinking it would be easier in the nextflow module to just use a java check for directories to see if they were readable/writeable using isReadable and isWriteable and only using the file type checking from pipeval. Just because I imagine checking if a directory/file is readable or writeable would be a common use case even outside of pipeval where running the docker image might not be necessary.

and/or determine whether it's a file or directory using isFile or isDirectory to determine the behavior instead of specifying between directory or file options.

yashpatel6 commented 1 year ago

Added a few comments. From a Nextflow perspective, we'll still need to maintain the type argument in case users are validating files vs directories. It can just be generic now rather than having to specify the specific file type but the process module will still need two inputs: the path to the file/directory to be validated and the type (either file or directory-r or directory-rw for now)

I was thinking it would be easier in the nextflow module to just use a java check for directories to see if they were readable/writeable using isReadable and isWriteable and only using the file type checking from pipeval. Just because I imagine checking if a directory/file is readable or writeable would be a common use case even outside of pipeval where running the docker image might not be necessary.

and/or determine whether it's a file or directory using isFile or isDirectory to determine the behavior instead of specifying between directory or file options.

I'm not against this, we can discuss it as the NF WG. The parameter validation module already handles directory permissions so I think it should be fine to remove directory checking through pipeval