nf-core / rnavar

gatk4 RNA variant calling pipeline
https://nf-co.re/rnavar
MIT License
37 stars 32 forks source link

Failure on GATK4_BEDTOINTERVALLIST due to incorrect exome.bed generation from iGenomes References. #112

Open Shaun-Regenbaum opened 10 months ago

Shaun-Regenbaum commented 10 months ago

Description of the bug

Then GATK4_BEDTOINTERVALLIST sometimes fails when using a variety of references genomes due to the incorrect creation of the genome.dict or exome.bed file from the reference GTF files. This results in a sequence dictionary mismatch between the two which leads the step to fail.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Shaun-Regenbaum commented 10 months ago

I am going to do some more exploration of this, and hopefully submit a PR with a fix this week.

Shaun-Regenbaum commented 10 months ago

I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:

chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129

Shaun-Regenbaum commented 10 months ago

My fix was to add a workflow step that simply filters the exome.bed file by what chromosomes are defined by the genome.dict file. It shouldn't affect other pipelines and should just allow the pipeline to handle a greater variety of refrence genomes/species.

maxulysse commented 10 months ago

I love this idea, that's an amazing addition

SAADAT-Abu commented 4 months ago

I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:

chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129

I am using your version with all default parameters and got this error

Command error:
  Using GATK jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx30g -jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar BedToIntervalList --INPUT exome.bed --OUTPUT genome.bed.interval_list --SEQUENCE_DICTIONARY genome.dict --TMP_DIR .
  14:48:55.685 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
  [Wed Jul 17 14:48:55 GMT 2024] BedToIntervalList --INPUT exome.bed --SEQUENCE_DICTIONARY genome.dict --OUTPUT genome.bed.interval_list --TMP_DIR . --SORT true --UNIQUE false --DROP_MISSING_CONTIGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

I would be glad if you can help.