Enabling region of interest (-x <.bed>)

nh13 / DWGSIM

Whole Genome Simulator for Next-Generation Sequencing

GNU General Public License v2.0

92 stars 36 forks source link

Enabling region of interest (-x <.bed>) #56

Closed quacksawbones closed 4 years ago

quacksawbones commented 4 years ago

Is there a timeline for the implementation of the -x "region of interest bed" functionality? I saw that it has a note comment in the manual "[not used]". It seems to generate the reads but then has difficulty outputting the fastq files.

Thanks for this awesome tool.

nh13 commented 4 years ago

It should work, and if you don't specify the option, the help message should say "not used". Did you try it?

quacksawbones commented 4 years ago

I wasn't sure whether "not used" meant that it was not functioning yet, as opposed to "null".

I have tried using this switch, but unfortunately, when I use a .bed file to restrict the areas of the genome to generate reads from, it looks like the tool generates them internally. I can see the counter running (Currently on ), it gets to the end of whatever number of reads are to be generated (-N ), and it lists the contigs from the fasta reference that are not used (#0 skip sequence '' as it is not in the targeted region), but the tool does not output the reads. It hangs at the very last stage.

I can confirm that the fasta reference and the .bed file use the same numbering system (hg19, no chr prefix), and that the tool works both when I don't use a .bed file with the -x switch and when I use a fasta reference generated by bedtools getfasta -fi -bed .

Thanks.

quacksawbones commented 4 years ago

Have done some further investigations on this issue.

It appears like the "upper limit" for the number of read pairs that can be generated while the -x switch is being used is -N 7671 ( -N 7672 or above causes the output to hang).

However, looking at the build logs (https://hackmd.io/@OifJ0LTSSlaix6Pb0UEfpA/B1ect9AckI/edit - please switch to "Edit Mode"), it looks like I am receiving some errors when it comes to zlib (line 129 and beyond). Could this be related to compatibility issues with my cluster, the version of zlib and DWGSIM?

gcc x86_64 version: 2.17-292.el7 and 4.8.5-39.el7 (different servers)

zlib x86_64 version: 1.2.7-18.el7

Thanks

quacksawbones commented 4 years ago

Found a workaorund! I am able to correctly generate more that 7671 read pairs (and the number of reads I require), but ONLY if I use the -C switch to specify the coverage across the region specified by the .bed file. I can't get it to generate more than 7671 read pairs when the region is provided in a bed file when I use the -N switch.

Thanks

nh13 commented 4 years ago

@quacksawbones if you post your inputs, I could try to reproduce. I am closing as you found a workaround.