Open Shaun-Regenbaum opened 10 months ago
I am going to do some more exploration of this, and hopefully submit a PR with a fix this week.
I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:
chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129
My fix was to add a workflow step that simply filters the exome.bed file by what chromosomes are defined by the genome.dict file. It shouldn't affect other pipelines and should just allow the pipeline to handle a greater variety of refrence genomes/species.
I love this idea, that's an amazing addition
I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:
chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129
I am using your version with all default parameters and got this error
Command error:
Using GATK jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx30g -jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar BedToIntervalList --INPUT exome.bed --OUTPUT genome.bed.interval_list --SEQUENCE_DICTIONARY genome.dict --TMP_DIR .
14:48:55.685 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Jul 17 14:48:55 GMT 2024] BedToIntervalList --INPUT exome.bed --SEQUENCE_DICTIONARY genome.dict --OUTPUT genome.bed.interval_list --TMP_DIR . --SORT true --UNIQUE false --DROP_MISSING_CONTIGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
I would be glad if you can help.
Description of the bug
Then GATK4_BEDTOINTERVALLIST sometimes fails when using a variety of references genomes due to the incorrect creation of the genome.dict or exome.bed file from the reference GTF files. This results in a sequence dictionary mismatch between the two which leads the step to fail.
Command used and terminal output
No response
Relevant files
No response
System information
No response