Closed mc-er closed 2 years ago
Hi @mc-er,
Thanks for reporting this. It is quite a confusing error. It appears that the temp file created from dysgu fetch
might be mangled some how. The error -4 while reading file
can be traced back to a htslib function int bam_read1(BGZF *fp, bam1_t *b)
from https://github.com/samtools/htslib/blob/develop/sam.c It looks like -4 can be returned in a few situations, mostly when the bam record is mangled.
Im not sure if the problem is with the temp bam file, or somthing happening during the multiprocessing step of dysgu call
- Would you be able to run a quick test for me to try and work out where the problem is? Could you try running:
dysgu call --procs 1
using the temp file located at P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam
? If this fails then the problem is probably with the temp bam file. Thanks
Hi @kcleal, Thanks for your quick response. I've done a run like you suggested, or I think I've done it in that way. It seems to run through fine. This is the output:
2022-07-21 16:53:57,331 [INFO ] [dysgu-fetch] Version: 1.3.11
2022-07-21 16:53:57,349 [INFO ] Searching regions from /home/mimmie/glob/dysgu/chr1_1.bed
2022-07-21 16:56:46,522 [INFO ] dysgu fetch dedup.ir.P9904-117.bam written to P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam, n=418562, time=0:02:49 h:m:s
2022-07-21 16:56:48,729 [INFO ] [dysgu-call] Version: 1.3.11
2022-07-21 16:56:48,729 [INFO ] Input file is: P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-21 16:56:48,730 [INFO ] call --procs 1 --overwrite parsed.pabies-2.0.fa P9904-117_temp P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-21 16:56:48,814 [INFO ] Sample name: P9904-117
2022-07-21 16:56:48,814 [INFO ] Writing vcf to stdout
2022-07-21 16:56:48,814 [INFO ] Running pipeline
2022-07-21 16:56:49,583 [INFO ] Calculating insert size. Removed 3000 outliers with insert size >= 1218
2022-07-21 16:56:49,593 [INFO ] Inferred read length 151.0, insert median 359, insert stdev 151
2022-07-21 16:56:49,595 [INFO ] Max clustering dist 1114
2022-07-21 16:56:49,595 [INFO ] Minimum support 3
2022-07-21 16:56:49,597 [INFO ] Building graph with clustering 1114 bp
2022-07-21 16:56:58,767 [INFO ] Total input reads 418562
2022-07-21 16:56:59,004 [INFO ] Graph constructed
2022-07-21 16:57:39,632 [INFO ] Number of components 258723. N candidates 38010
2022-07-21 16:57:42,691 [INFO ] Re-alignment of soft-clips done. N candidates 10878
2022-07-21 16:57:43,061 [INFO ] Number of candidate SVs merged: 1898
2022-07-21 16:57:43,061 [INFO ] Number of candidate SVs after merge: 8980
2022-07-21 16:57:43,063 [INFO ] Number of candidate SVs dropped with sv-len < min-size or support < min support: 4779
2022-07-21 16:57:43,117 [INFO ] Number or SVs near gaps dropped 0
2022-07-21 16:57:43,135 [INFO ] Loaded n=1 chromosome coverage arrays from P9904-117_temp
/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.1.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
2022-07-21 16:57:46,139 [INFO ] Model: pe, diploid: True, contig features: True. N features: 43
2022-07-21 16:57:47,932 [INFO ] dysgu call P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam complete, n=4201, time=0:00:59 h:m:s
I don't know if this matters, but one thing that might be worth mentioning is that when I installed dysgu the htslib git was broken and resulted in an empty folder so I manually added the latest version of htslib.
Thanks @mc-er, this is very useful. This brings up a few questions - Does the OSError: error -4 while reading file
appear everytime you run that sample with -p4
, or does it occur only sometimes? Additionally, in the second example you provided, did you run dysgu fetch
with --procs 4
as for the first dysgu run
example?
It seems 'git clone --recursive' does not work for newer htslib versions. I will get the readme updated to reflect that, thanks.
For your first questing, for files where the OSError: error -4 while reading file
error occurs it occurs every time -p4
is used. The second question I didn't use -p4
when running dysgu fetch
, I can quickly try a run doing that.
It also run through fine when running dysgu fetch -p4
2022-07-22 08:19:30,755 [INFO ] [dysgu-fetch] Version: 1.3.11
2022-07-22 08:19:30,755 [INFO ] Searching regions from /home/mimmie/glob/dysgu/chr1_1.bed
2022-07-22 08:28:28,928 [INFO ] dysgu fetch dedup.ir.P9904-117.bam written to P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam, n=418562, time=0:08:58 h:m:s
2022-07-22 08:28:31,105 [INFO ] [dysgu-call] Version: 1.3.11
2022-07-22 08:28:31,105 [INFO ] Input file is: P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-22 08:28:31,105 [INFO ] call --procs 1 --overwrite parsed.pabies-2.0.fa P9904-117_temp P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-22 08:28:31,176 [INFO ] Sample name: P9904-117
2022-07-22 08:28:31,176 [INFO ] Writing vcf to stdout
2022-07-22 08:28:31,176 [INFO ] Running pipeline
2022-07-22 08:28:31,821 [INFO ] Calculating insert size. Removed 3000 outliers with insert size >= 1218
2022-07-22 08:28:31,828 [INFO ] Inferred read length 151.0, insert median 359, insert stdev 151
2022-07-22 08:28:31,829 [INFO ] Max clustering dist 1114
2022-07-22 08:28:31,829 [INFO ] Minimum support 3
2022-07-22 08:28:31,829 [INFO ] Building graph with clustering 1114 bp
2022-07-22 08:28:43,524 [INFO ] Total input reads 418562
2022-07-22 08:28:43,776 [INFO ] Graph constructed
2022-07-22 08:29:44,788 [INFO ] Number of components 258723. N candidates 38010
2022-07-22 08:29:48,568 [INFO ] Re-alignment of soft-clips done. N candidates 10878
2022-07-22 08:29:49,001 [INFO ] Number of candidate SVs merged: 1898
2022-07-22 08:29:49,001 [INFO ] Number of candidate SVs after merge: 8980
2022-07-22 08:29:49,004 [INFO ] Number of candidate SVs dropped with sv-len < min-size or support < min support: 4779
2022-07-22 08:29:49,046 [INFO ] Number or SVs near gaps dropped 0
2022-07-22 08:29:49,062 [INFO ] Loaded n=1 chromosome coverage arrays from P9904-117_temp
/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.1.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
2022-07-22 08:29:52,989 [INFO ] Model: pe, diploid: True, contig features: True. N features: 43
2022-07-22 08:29:55,037 [INFO ] dysgu call P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam complete, n=4201, time=0:01:23 h:m:s
Thanks for running these tests. This might be a problem with dysgu, not htslib, im still a bit confused about the possible source at the moment. As the file does not look too big, would you mind sending me the dysgu_reads.bam file generated by dysgu fetch (via email clealk@cardiff.ac.uk)? This will make it much easier to find the issue, and also I can check if th error is reproducible (I will delete the file as soon as the issue is fixed). Is this a custom reference genome you are using also? Thanks for your input!
From: mc-er @.> Sent: Friday, July 22, 2022 8:34:41 AM To: kcleal/dysgu @.> Cc: Kez Cleal @.>; Mention @.> Subject: Re: [kcleal/dysgu] Multiprocessing error (Issue #44)
External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.
It also run through fine when running dysgu fetch -p4
2022-07-22 08:19:30,755 [INFO ] [dysgu-fetch] Version: 1.3.11 2022-07-22 08:19:30,755 [INFO ] Searching regions from /home/mimmie/glob/dysgu/chr1_1.bed 2022-07-22 08:28:28,928 [INFO ] dysgu fetch dedup.ir.P9904-117.bam written to P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam, n=418562, time=0:08:58 h:m:s 2022-07-22 08:28:31,105 [INFO ] [dysgu-call] Version: 1.3.11 2022-07-22 08:28:31,105 [INFO ] Input file is: P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam 2022-07-22 08:28:31,105 [INFO ] call --procs 1 --overwrite parsed.pabies-2.0.fa P9904-117_temp P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam 2022-07-22 08:28:31,176 [INFO ] Sample name: P9904-117 2022-07-22 08:28:31,176 [INFO ] Writing vcf to stdout 2022-07-22 08:28:31,176 [INFO ] Running pipeline 2022-07-22 08:28:31,821 [INFO ] Calculating insert size. Removed 3000 outliers with insert size >= 1218 2022-07-22 08:28:31,828 [INFO ] Inferred read length 151.0, insert median 359, insert stdev 151 2022-07-22 08:28:31,829 [INFO ] Max clustering dist 1114 2022-07-22 08:28:31,829 [INFO ] Minimum support 3 2022-07-22 08:28:31,829 [INFO ] Building graph with clustering 1114 bp 2022-07-22 08:28:43,524 [INFO ] Total input reads 418562 2022-07-22 08:28:43,776 [INFO ] Graph constructed 2022-07-22 08:29:44,788 [INFO ] Number of components 258723. N candidates 38010 2022-07-22 08:29:48,568 [INFO ] Re-alignment of soft-clips done. N candidates 10878 2022-07-22 08:29:49,001 [INFO ] Number of candidate SVs merged: 1898 2022-07-22 08:29:49,001 [INFO ] Number of candidate SVs after merge: 8980 2022-07-22 08:29:49,004 [INFO ] Number of candidate SVs dropped with sv-len < min-size or support < min support: 4779 2022-07-22 08:29:49,046 [INFO ] Number or SVs near gaps dropped 0 2022-07-22 08:29:49,062 [INFO ] Loaded n=1 chromosome coverage arrays from P9904-117_temp /home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.1.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn( 2022-07-22 08:29:52,989 [INFO ] Model: pe, diploid: True, contig features: True. N features: 43 2022-07-22 08:29:55,037 [INFO ] dysgu call P9904-117_temp/dedup.ir.P9904-117.dysgu_reads.bam complete, n=4201, time=0:01:23 h:m:s
— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkcleal%2Fdysgu%2Fissues%2F44%23issuecomment-1192275926&data=05%7C01%7Cclealk%40cardiff.ac.uk%7C41c9deed80d546fa79bf08da6bb4a499%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637940720840722665%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o9qMIDJu%2BILVo%2F9CQRZ1kHfmEfzCpSuLaD%2BOgbckWcM%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKIBQHIGH5FMTFPFCLVKGFDVVJFJDANCNFSM54GHRIQA&data=05%7C01%7Cclealk%40cardiff.ac.uk%7C41c9deed80d546fa79bf08da6bb4a499%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C637940720840722665%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=UfFurh03JqRLYh2kqbPp2SKXqo8v53n3jhGadCjRwQU%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
I've done another test run out of curiosity, I ran dysgu fetch --procs 4
followed by dysgu call --procs 4
using the dysgu_reads.bam
file which ran through fine. So I ran dysgu run --procs 4
and the error appeared again.
#########################
# running dysgu fetch --procs 4
2022-07-22 11:27:35,847 [INFO ] [dysgu-fetch] Version: 1.3.11
2022-07-22 11:27:35,848 [INFO ] Searching regions from /home/mimmie/glob/dysgu/chr1_1.bed
2022-07-22 11:32:17,281 [INFO ] dysgu fetch dedup.ir.P9904-117.bam written to P9904-117_temp1/dedup.ir.P9904-117.dysgu_reads.bam, n=418562, time=0:04:41 h:m:s
#########################
# running dysgu call --procs 4
2022-07-22 11:32:19,362 [INFO ] [dysgu-call] Version: 1.3.11
2022-07-22 11:32:19,362 [INFO ] Input file is: P9904-117_temp1/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-22 11:32:19,362 [INFO ] call --procs 4 --overwrite parsed.pabies-2.0.fa P9904-117_temp1 P9904-117_temp1/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-22 11:32:19,479 [INFO ] Sample name: P9904-117
2022-07-22 11:32:19,479 [INFO ] Writing vcf to stdout
2022-07-22 11:32:19,479 [INFO ] Running pipeline
2022-07-22 11:32:19,983 [INFO ] Calculating insert size. Removed 3000 outliers with insert size >= 1218
2022-07-22 11:32:19,989 [INFO ] Inferred read length 151.0, insert median 359, insert stdev 151
2022-07-22 11:32:19,990 [INFO ] Max clustering dist 1114
2022-07-22 11:32:19,991 [INFO ] Minimum support 3
2022-07-22 11:32:19,991 [INFO ] Building graph with clustering 1114 bp
2022-07-22 11:32:31,183 [INFO ] Total input reads 418562
2022-07-22 11:32:31,418 [INFO ] Graph constructed
2022-07-22 11:32:48,114 [INFO ] Number of components 258723. N candidates 37990
2022-07-22 11:32:51,954 [INFO ] Re-alignment of soft-clips done. N candidates 10860
2022-07-22 11:32:52,383 [INFO ] Number of candidate SVs merged: 1901
2022-07-22 11:32:52,384 [INFO ] Number of candidate SVs after merge: 8959
2022-07-22 11:32:52,386 [INFO ] Number of candidate SVs dropped with sv-len < min-size or support < min support: 4753
2022-07-22 11:32:52,425 [INFO ] Number or SVs near gaps dropped 1
2022-07-22 11:32:52,439 [INFO ] Loaded n=1 chromosome coverage arrays from P9904-117_temp1
/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.1.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
2022-07-22 11:32:56,385 [INFO ] Model: pe, diploid: True, contig features: True. N features: 43
2022-07-22 11:32:58,515 [INFO ] dysgu call P9904-117_temp1/dedup.ir.P9904-117.dysgu_reads.bam complete, n=4205, time=0:00:39 h:m:s
#########################
# running dysgu run --procs 4
2022-07-22 11:33:00,461 [INFO ] [dysgu-run] Version: 1.3.11
2022-07-22 11:33:00,461 [INFO ] run --procs 4 -c --regions /home/mimmie/glob/dysgu/chr1_1.bed --regions-only True parsed.pabies-2.0.fa P9904-117_temp2 dedup.ir.P9904-117.bam
2022-07-22 11:33:00,461 [INFO ] Destination: P9904-117_temp2
2022-07-22 11:33:00,461 [INFO ] Searching regions from /home/mimmie/glob/dysgu/chr1_1.bed
2022-07-22 11:33:09,437 [INFO ] dysgu fetch dedup.ir.P9904-117.bam written to P9904-117_temp2/dedup.ir.P9904-117.dysgu_reads.bam, n=417399, time=0:00:08 h:m:s
2022-07-22 11:33:09,499 [INFO ] Input file is: P9904-117_temp2/dedup.ir.P9904-117.dysgu_reads.bam
2022-07-22 11:33:09,500 [INFO ] Input file has no index, but --include was provided, attempting to index
2022-07-22 11:33:10,074 [INFO ] Sample name: P9904-117
2022-07-22 11:33:10,074 [INFO ] Writing vcf to stdout
2022-07-22 11:33:10,074 [INFO ] Running pipeline
2022-07-22 11:33:10,802 [INFO ] Calculating insert size. Removed 193 outliers with insert size >= 851.0
2022-07-22 11:33:10,818 [INFO ] Inferred read length 151.0, insert median 387, insert stdev 87
2022-07-22 11:33:10,819 [INFO ] Max clustering dist 822
2022-07-22 11:33:10,819 [INFO ] Minimum support 3
2022-07-22 11:33:10,819 [INFO ] Building graph with clustering 822 bp
2022-07-22 11:33:30,749 [INFO ] Total input reads 533293
2022-07-22 11:42:06,693 [INFO ] Graph constructed
Process Process-1:
Traceback (most recent call last):
File "/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/mimmie/.pyenv/versions/3.9.9/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "dysgu/cluster.pyx", line 813, in dysgu.cluster.process_job
File "dysgu/cluster.pyx", line 761, in dysgu.cluster.component_job
File "dysgu/call_component.pyx", line 1972, in dysgu.call_component.call_from_block_model
File "dysgu/call_component.pyx", line 1986, in dysgu.call_component.call_from_block_model
File "dysgu/call_component.pyx", line 1933, in dysgu.call_component.multi
File "dysgu/call_component.pyx", line 1822, in dysgu.call_component.get_reads
File "pysam/libcalignmentfile.pyx", line 1876, in pysam.libcalignmentfile.AlignmentFile.__next__
OSError: error -4 while reading file
Of course I can send you the file. Happy to help to sort out the error! Given the new information would you want one that is created during dysgu run
?
Great, that is quite revealing. It looks like the error is coming from the --regions
/ --regions-only True
options. I think I have enough to try and find the error now. In the mean time, I can recommend running dysgu as a two step pipeline (avoid using dysgu run) as you have above. So use dysgu fetch -p4 --search ...
followed by dysgu call ...
. This should give equivalent results of the run
pipeline anyway
A side note. I started a run as you suggested and got this Exception: --regions-only is not currently supported via the 'call' command
error
I can see some issues with how the --regions-only True
option is treated - I think these have arisen with more recent versions of dysgu, my apologies. Since you have run dysgu fetch --search
this hard filters the genome for the target regions so --regions-only
will be redundant in this case, and will not be needed for the subsequent dysgu call
command.
Further to my last message, I think the subsequent dysgu call
the --ibam
option should also be filled out; this will flag that the current input bam file was derived from dysgu fetch
, as well as providing a better sample for insert size calculation.
I believe the error has now been resolved. You will have to use git clone and build dysgu from source whilst I get a release uploaded to pypi.
There was a problem with the pipeline when using --regions-only True
. This option is now only supported when using dysgu call
, and it now behaves as expected i.e. when '--regions-only True' is set, the input regions are scanned and mate-pair information is used to try and add additional regions to search. I have added a warning in the readme that this can lead to high memory usage if the input regions are large. Hope this resolved things for you.
Fantastic! Thank you very much for all the help!
I will close this for now, but please re-open if any issues arise
I'm trying to run dysgu with the multiprocessing flag (-p), some samples runs smoothly but once in a while i get errors like this one:
If i run the sample without multiprocessing, it runs through without any problems. Since I have over 1000 samples with massive genomes it's not an option to run without multiprocessing so I'm grateful for any ideas and suggestions to what could cause this error and how to fix it.