nf-core / rnavar

gatk4 RNA variant calling pipeline
https://nf-co.re/rnavar
MIT License
31 stars 30 forks source link

Failure on GATK4_BEDTOINTERVALLIST due to incorrect exome.bed generation from iGenomes References. #112

Open Shaun-Regenbaum opened 6 months ago

Shaun-Regenbaum commented 6 months ago

Description of the bug

Then GATK4_BEDTOINTERVALLIST sometimes fails when using a variety of references genomes due to the incorrect creation of the genome.dict or exome.bed file from the reference GTF files. This results in a sequence dictionary mismatch between the two which leads the step to fail.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Shaun-Regenbaum commented 6 months ago

I am going to do some more exploration of this, and hopefully submit a PR with a fix this week.

Shaun-Regenbaum commented 6 months ago

I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:

chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129

Shaun-Regenbaum commented 6 months ago

My fix was to add a workflow step that simply filters the exome.bed file by what chromosomes are defined by the genome.dict file. It shouldn't affect other pipelines and should just allow the pipeline to handle a greater variety of refrence genomes/species.

maxulysse commented 6 months ago

I love this idea, that's an amazing addition