jkimlab / PAPipe

29 stars 1 forks source link

PAPipe fails when Reference dbSNP file is not provided #1

Open diegovalenzuelam opened 5 months ago

diegovalenzuelam commented 5 months ago

Hello!

thanks for your amazing tool. I was trying to run PAPipe with my own data, but the analysis kept crushing at the variant calling step. When I was trying to figure out what was the problem, I realized that the pipeline fails when Reference dbSNP file is not provided. I tested this with the "test data" by deleting the cow.chr1.dbsnp.vcf.gz file and changing the line 47 from "DBSNP = /RUN_DOCKER/data/ref/cow.chr1.dbsnp.vcf.gz" to "DBSNP = /RUN_DOCKER/data/ref/" in the main:param.txt file.

I'm attaching the dbsnp.idex.log:

00:16:28.344 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Apr 24, 2024 12:16:28 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 00:16:28.436 INFO IndexFeatureFile - ------------------------------------------------------------ 00:16:28.436 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.1.7.0 00:16:28.436 INFO IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/ 00:16:28.436 INFO IndexFeatureFile - Executing as root@42c66c745761 on Linux v6.5.0-27-generic amd64 00:16:28.436 INFO IndexFeatureFile - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12 00:16:28.436 INFO IndexFeatureFile - Start Date/Time: April 24, 2024 12:16:28 AM MSK 00:16:28.436 INFO IndexFeatureFile - ------------------------------------------------------------ 00:16:28.436 INFO IndexFeatureFile - ------------------------------------------------------------ 00:16:28.436 INFO IndexFeatureFile - HTSJDK Version: 2.21.2 00:16:28.436 INFO IndexFeatureFile - Picard Version: 2.21.9 00:16:28.436 INFO IndexFeatureFile - HTSJDK Defaults.COMPRESSION_LEVEL : 2 00:16:28.436 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 00:16:28.436 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 00:16:28.436 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 00:16:28.436 INFO IndexFeatureFile - Deflater: IntelDeflater 00:16:28.436 INFO IndexFeatureFile - Inflater: IntelInflater 00:16:28.436 INFO IndexFeatureFile - GCS max retries/reopens: 20 00:16:28.436 INFO IndexFeatureFile - Requester pays: disabled 00:16:28.436 INFO IndexFeatureFile - Initializing engine 00:16:28.436 INFO IndexFeatureFile - Done initializing engine 00:16:28.577 INFO IndexFeatureFile - Shutting down engine [April 24, 2024 12:16:28 AM MSK] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2129133568


A USER ERROR has occurred: Cannot read file:///RUN_DOCKER/out/02_VariantCalling/REF/ref/ because no suitable codecs found


Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace. Using GATK jar /opt/conda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar IndexFeatureFile -I /RUN_DOCKER/out/02_VariantCalling/REF/ref

Regards,

Diego

nayoung9 commented 5 months ago

Thank you for reaching out, and I'm glad to hear that you find PAPipe useful. Based on your description and testing with the test data, the pipeline fails when the reference dbSNP file is not provided.

If you do not need to use a reference dbSNP file for your analysis, you can remove the whole DBSNP parameter line from the main.param.txt file. By doing so, the pipeline will not use a dbSNP file and will ignore that configuration part.

diegovalenzuelam commented 5 months ago

Thanks Nayoung for your fast answer. I changed the main_param.txt file as you suggested, but Im still having some issues using the test data. The problem now seems to be with the Base Recalibration step, since one of the mandatory arguments to run BaseRecalibrator is a databases of known polymorphic sites (--known-sites:FeatureInput).

Here is a part of the log file showing the error:


A USER ERROR has occurred: Invalid argument '/RUN_DOCKER/out/02_VariantCalling/01.BaseRecalibration/Angus_Angus4.table'.


and the full output:

USAGE: BaseRecalibrator [arguments]

First pass of the Base Quality Score Recalibration (BQSR) -- Generates recalibration table based on various user-specified covariates (such as read group, reported quality score, machine cycle, and nucleotide context). Version:4.1.7.0

Required Arguments:

--input,-I:String BAM/SAM/CRAM file containing reads This argument must be specified at least once. Required.

--known-sites:FeatureInput One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. This argument must be specified at least once. Required.

--output,-O:File The output recalibration table file to create Required.

--reference,-R:GATKPathSpecifier Reference sequence file Required.

Optional Arguments:

--add-output-sam-program-record,-add-output-sam-program-record:Boolean If true, adds a PG tag to created SAM/BAM/CRAM files. Default value: true. Possible values: {true, false}

--add-output-vcf-command-line,-add-output-vcf-command-line:Boolean If true, adds a command line header line to created VCF files. Default value: true. Possible values: {true, false}

--arguments_file:File read one or more arguments files and add them to the command line This argument may be specified 0 or more times. Default value: null.

--binary-tag-name:String the binary tag covariate name if using it Default value: null.

--bqsr-baq-gap-open-penalty:Double BQSR BAQ gap open penalty (Phred Scaled). Default value is 40. 30 is perhaps better for whole genome call sets Default value: 40.0.

--cloud-index-prefetch-buffer,-CIPB:Integer Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset. Default value: -1.

--cloud-prefetch-buffer,-CPB:Integer Size of the cloud-only prefetch buffer (in MB; 0 to disable). Default value: 40.

--create-output-bam-index,-OBI:Boolean If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file. Default value: true. Possible values: {true, false}

--create-output-bam-md5,-OBM:Boolean If true, create a MD5 digest for any BAM/SAM/CRAM file created Default value: false. Possible values: {true, false}

--create-output-variant-index,-OVI:Boolean If true, create a VCF index when writing a coordinate-sorted VCF file. Default value: true. Possible values: {true, false}

--create-output-variant-md5,-OVM:Boolean If true, create a a MD5 digest any VCF file created. Default value: false. Possible values: {true, false}

--default-base-qualities:Byte Assign a default base quality Default value: -1.

--deletions-default-quality:Byte default quality for the base deletions covariate Default value: 45.

--disable-bam-index-caching,-DBIC:Boolean If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified. Default value: false. Possible values: {true, false}

--disable-read-filter,-DF:String Read filters to be disabled before analysis This argument may be specified 0 or more times. Default value: null. Possible Values: {MappedReadFilter, MappingQualityAvailableReadFilter, MappingQualityNotZeroReadFilter, NotDuplicateReadFilter, NotSecondaryAlignmentReadFilter, PassesVendorQualityCheckReadFilter, WellformedReadFilter}

--disable-sequence-dictionary-validation,-disable-sequence-dictionary-validation:Boolean If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk! Default value: false. Possible values: {true, false}

--exclude-intervals,-XL:StringOne or more genomic intervals to exclude from processing This argument may be specified 0 or more times. Default value: null.

--gatk-config-file:String A configuration file to use with the GATK. Default value: null.

--gcs-max-retries,-gcs-retries:Integer If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection Default value: 20.

--gcs-project-for-requester-pays:String Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. Default value: .

--help,-h:Boolean display the help message Default value: false. Possible values: {true, false}

--indels-context-size,-ics:Integer Size of the k-mer context to be used for base insertions and deletions Default value: 3.

--insertions-default-quality:Byte default quality for the base insertions covariate Default value: 45.

--interval-exclusion-padding,-ixp:Integer Amount of padding (in bp) to add to each interval you are excluding. Default value: 0.

--interval-merging-rule,-imr:IntervalMergingRule Interval merging rule for abutting intervals Default value: ALL. Possible values: {ALL, OVERLAPPING_ONLY}

--interval-padding,-ip:IntegerAmount of padding (in bp) to add to each interval you are including. Default value: 0.

--interval-set-rule,-isr:IntervalSetRule Set merging approach to use for combining interval inputs Default value: UNION. Possible values: {UNION, INTERSECTION}

--intervals,-L:String One or more genomic intervals over which to operate This argument may be specified 0 or more times. Default value: null.

--lenient,-LE:Boolean Lenient processing of VCF files Default value: false. Possible values: {true, false}

--low-quality-tail:Byte minimum quality for the bases in the tail of the reads to be considered Default value: 2.

--maximum-cycle-value,-max-cycle:Integer The maximum cycle value permitted for the Cycle covariate Default value: 500.

--mismatches-context-size,-mcs:Integer Size of the k-mer context to be used for base mismatches Default value: 2.

--mismatches-default-quality:Byte default quality for the base mismatches covariate Default value: -1.

--preserve-qscores-less-than:Integer Don't recalibrate bases with quality scores less than this threshold (with -bqsr) Default value: 6.

--quantizing-levels:Integer number of distinct quality scores in the quantized output Default value: 16.

--QUIET:Boolean Whether to suppress job-summary info on System.err. Default value: false. Possible values: {true, false}

--read-filter,-RF:String Read filters to be applied before analysis This argument may be specified 0 or more times. Default value: null. Possible Values: {AlignmentAgreesWithHeaderReadFilter, AllowAllReadsReadFilter, AmbiguousBaseReadFilter, CigarContainsNoNOperator, FirstOfPairReadFilter, FragmentLengthReadFilter, GoodCigarReadFilter, HasReadGroupReadFilter, IntervalOverlapReadFilter, LibraryReadFilter, MappedReadFilter, MappingQualityAvailableReadFilter, MappingQualityNotZeroReadFilter, MappingQualityReadFilter, MatchingBasesAndQualsReadFilter, MateDifferentStrandReadFilter, MateDistantReadFilter, MateOnSameContigOrNoMappedMateReadFilter, MateUnmappedAndUnmappedReadFilter, MetricsReadFilter, NonChimericOriginalAlignmentReadFilter, NonZeroFragmentLengthReadFilter, NonZeroReferenceLengthAlignmentReadFilter, NotDuplicateReadFilter, NotOpticalDuplicateReadFilter, NotProperlyPairedReadFilter, NotSecondaryAlignmentReadFilter, NotSupplementaryAlignmentReadFilter, OverclippedReadFilter, PairedReadFilter, PassesVendorQualityCheckReadFilter, PlatformReadFilter, PlatformUnitReadFilter, PrimaryLineReadFilter, ProperlyPairedReadFilter, ReadGroupBlackListReadFilter, ReadGroupReadFilter, ReadLengthEqualsCigarLengthReadFilter, ReadLengthReadFilter, ReadNameReadFilter, ReadStrandFilter, SampleReadFilter, SecondOfPairReadFilter, SeqIsStoredReadFilter, SoftClippedReadFilter, ValidAlignmentEndReadFilter, ValidAlignmentStartReadFilter, WellformedReadFilter}

--read-index,-read-index:String Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically. This argument may be specified 0 or more times. Default value: null.

--read-validation-stringency,-VS:ValidationStringency Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: SILENT. Possible values: {STRICT, LENIENT, SILENT}

--seconds-between-progress-updates,-seconds-between-progress-updates:Double Output traversal statistics every time this many seconds elapse Default value: 10.0.

--sequence-dictionary,-sequence-dictionary:String Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. Default value: null.

--sites-only-vcf-output:Boolean If true, don't emit genotype fields when writing vcf file output. Default value: false. Possible values: {true, false}

--tmp-dir:GATKPathSpecifier Temp directory to use. Default value: null.

--use-jdk-deflater,-jdk-deflater:Boolean Whether to use the JdkDeflater (as opposed to IntelDeflater) Default value: false. Possible values: {true, false}

--use-jdk-inflater,-jdk-inflater:Boolean Whether to use the JdkInflater (as opposed to IntelInflater) Default value: false. Possible values: {true, false}

--use-original-qualities,-OQ:Boolean Use the base quality scores from the OQ tag Default value: false. Possible values: {true, false}

--verbosity,-verbosity:LogLevel Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING, INFO, DEBUG}

--version:Boolean display the version number for this tool Default value: false. Possible values: {true, false}

Advanced Arguments:

--disable-tool-default-read-filters,-disable-tool-default-read-filters:Boolean Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on) Default value: false. Possible values: {true, false}

--showHidden,-showHidden:Boolean display hidden arguments Default value: false. Possible values: {true, false}

Conditional Arguments for readFilter:

Valid only if "AmbiguousBaseReadFilter" is specified: --ambig-filter-bases:Integer Threshold number of ambiguous bases. If null, uses threshold fraction; otherwise, overrides threshold fraction. Default value: null. Cannot be used in conjuction with argument(s) maxAmbiguousBaseFraction

--ambig-filter-frac:Double Threshold fraction of ambiguous bases Default value: 0.05. Cannot be used in conjuction with argument(s) maxAmbiguousBases

Valid only if "FragmentLengthReadFilter" is specified: --max-fragment-length:Integer Maximum length of fragment (insert size) Default value: 1000000.

--min-fragment-length:Integer Minimum length of fragment (insert size) Default value: 0.

Valid only if "IntervalOverlapReadFilter" is specified: --keep-intervals:String One or more genomic intervals to keep This argument must be specified at least once. Required.

Valid only if "LibraryReadFilter" is specified: --library,-library:String Name of the library to keep This argument must be specified at least once. Required.

Valid only if "MappingQualityReadFilter" is specified: --maximum-mapping-quality:Integer Maximum mapping quality to keep (inclusive) Default value: null.

--minimum-mapping-quality:Integer Minimum mapping quality to keep (inclusive) Default value: 10.

Valid only if "MateDistantReadFilter" is specified: --mate-too-distant-length:Integer Minimum start location difference at which mapped mates are considered distant Default value: 1000.

Valid only if "OverclippedReadFilter" is specified: --dont-require-soft-clips-both-ends:Boolean Allow a read to be filtered out based on having only 1 soft-clipped block. By default, both ends must have a soft-clipped block, setting this flag requires only 1 soft-clipped block Default value: false. Possible values: {true, false}

--filter-too-short:Integer Minimum number of aligned bases Default value: 30.

Valid only if "PlatformReadFilter" is specified: --platform-filter-name:String Platform attribute (PL) to match This argument must be specified at least once. Required.

Valid only if "PlatformUnitReadFilter" is specified: --black-listed-lanes:String Platform unit (PU) to filter out This argument must be specified at least once. Required.

Valid only if "ReadGroupBlackListReadFilter" is specified: --read-group-black-list:StringA read group filter expression in the form "attribute:value", where "attribute" is a two character read group attribute such as "RG" or "PU". This argument must be specified at least once. Required.

Valid only if "ReadGroupReadFilter" is specified: --keep-read-group:String The name of the read group to keep Required.

Valid only if "ReadLengthReadFilter" is specified: --max-read-length:Integer Keep only reads with length at most equal to the specified value Required.

--min-read-length:Integer Keep only reads with length at least equal to the specified value Default value: 1.

Valid only if "ReadNameReadFilter" is specified: --read-name:String Keep only reads with this read name Required.

Valid only if "ReadStrandFilter" is specified: --keep-reverse-strand-only:Boolean Keep only reads on the reverse strand Required. Possible values: {true, false}

Valid only if "SampleReadFilter" is specified: --sample,-sample:String The name of the sample(s) to keep, filtering out all others This argument must be specified at least once. Required.

Valid only if "SoftClippedReadFilter" is specified: --invert-soft-clip-ratio-filter:Boolean Inverts the results from this filter, causing all variants that would pass to fail and visa-versa. Default value: false. Possible values: {true, false}

--soft-clipped-leading-trailing-ratio:Double Threshold ratio of soft clipped bases (leading / trailing the cigar string) to total bases in read for read to be filtered. Default value: null. Cannot be used in conjuction with argument(s) minimumSoftClippedRatio

--soft-clipped-ratio-threshold:Double Threshold ratio of soft clipped bases (anywhere in the cigar string) to total bases in read for read to be filtered. Default value: null. Cannot be used in conjuction with argument(s) minimumLeadingTrailingSoftClippedRatio


A USER ERROR has occurred: Invalid argument '/RUN_DOCKER/out/02_VariantCalling/01.BaseRecalibration/Angus_Angus4.table'.


org.broadinstitute.barclay.argparser.CommandLineException: Invalid argument '/RUN_DOCKER/out/02_VariantCalling/01.BaseRecalibration/Angus_Angus4.table'. at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.setPositionalArgument(CommandLineArgumentParser.java:600) at org.broadinstitute.barclay.argparser.CommandLineArgumentParser.parseArguments(CommandLineArgumentParser.java:432) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.parseArgs(CommandLineProgram.java:232) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206) at org.broadinstitute.hellbender.Main.main(Main.java:292) Using GATK jar /opt/conda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /opt/conda/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar BaseRecalibrator -I /RUN_DOCKER/out/01_ReadMapping/04.ReadRegrouping/Angus_Angus4.addRG.marked.sort.bam -R /RUN_DOCKER/out/02_VariantCalling/REF/cow.chr1.fa --known-sites -O /RUN_DOCKER/out/02_VariantCalling/01.BaseRecalibration/Angus_Angus4.table