kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
92 stars 11 forks source link

_pickle.UnpicklingError: invalid load key, 'A'. Failed to read from standard input: unknown file type #79

Closed wenyuhaokikika closed 9 months ago

wenyuhaokikika commented 9 months ago

Thanks for Dysgu ~~~

Many samples are running normally, but one of them has a problem. What is the reason? when I runrun -x -p 10 /public/home/wenyuhao/seq/WGS/D1/resources/genome.fasta /public/home/wenyuhao/seq/WGS/dysgu/tmpDir /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016851-1.bam, raise Exception

2023-12-19 14:39:34,444 [INFO   ]  [dysgu-run] Version: 1.6.2
2023-12-19 14:39:34,444 [INFO   ]  run -x -p 10 /public/home/wenyuhao/seq/WGS/D1/resources/genome.fasta /public/home/wenyuhao/seq/WGS/dysgu/tmpDir /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016851-1.bam
2023-12-19 14:39:34,444 [INFO   ]  Destination: /public/home/wenyuhao/seq/WGS/dysgu/tmpDir
2023-12-19 14:43:59,686 [INFO   ]  dysgu fetch /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016851-1.bam written to /public/home/wenyuhao/seq/WGS/dysgu/tmpDir/DRR016851-1.dysgu_reads.bam, n=2995745, time=0:04:25 h:m:s
2023-12-19 14:43:59,686 [INFO   ]  Input file is: /public/home/wenyuhao/seq/WGS/dysgu/tmpDir/DRR016851-1.dysgu_reads.bam
2023-12-19 14:44:00,472 [INFO   ]  Sample name: DRR016851
2023-12-19 14:44:00,472 [INFO   ]  Writing vcf to stdout
2023-12-19 14:44:00,472 [INFO   ]  Running pipeline
2023-12-19 14:44:00,931 [INFO   ]  Calculating insert size. Removed 86 outliers with insert size >= 777.0
2023-12-19 14:44:00,942 [INFO   ]  Inferred read length 101.0, insert median 281, insert stdev 93
2023-12-19 14:44:00,943 [INFO   ]  Max clustering dist 746
2023-12-19 14:44:00,943 [INFO   ]  Building graph with clustering 746 bp
2023-12-19 14:44:27,722 [INFO   ]  Total input reads 2995745
2023-12-19 14:44:29,379 [INFO   ]  Graph constructed
2023-12-19 14:44:29,380 [INFO   ]  Minimum support 3
Traceback (most recent call last):
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/dysgu/main.py", line 259, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1188, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 996, in dysgu.cluster.pipe1
_pickle.UnpicklingError: invalid load key, 'A'.
Failed to read from standard input: unknown file type

When this problem occurred, I thought there was a problem with my bam file, but when using samtolls views, it can be displayed normally.

samtools view /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016852-1.bam | head
DRR016852.155276582     163     1       10000   0       4S35M62S        =       10179   220     AACCATAACCCTAACCCTAACCCTAACCCTAACCCTAACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAA   :;?@+;<A&>?=;@>@>A;A=:@@?A><@@=?=?AA>A>>/7?>29A@<?>#.@@-=>><::>(<=7?52;<<;6=;D@,9A<:??-::-?<;<>.=A3=>     MC:Z:60S41M     MD:Z:35 PG:Z:MarkDuplicates     RG:Z:DRR016852  NM:i:0  AS:i:35 XS:i:35
DRR016852.15743215      99      1       10001   0       101M    =       10152   206     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   BBDACCDAC@CCDAD@BBCAD@BBCAC@BBCAD@BCDAD@CBBAC>BBBAB@BA=@B@C@=?;?CCC@D?@EBAD:@AC9BACD?@C@AEEBECD@DC>AF     MC:Z:18M1D36M   MD:Z:101        PG:Z:MarkDuplicates     RG:Z:DRR016852  NM:i:0  AS:i:101        XS:i:101
DRR016852.51109508      163     1       10001   0       46M3D55M        =       10066   157     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA   AAC@BAC@C@BCD@C?AAC@D?BACAD@BBCAD@BBC@D@CBB@C@>B@B@C>B@CCD:B?CCD@AACDE@C?CBDBD@AC@<C@BDE<C@DDEB<ABCEA     MC:Z:16M1I76M8S MD:Z:46^CCT55   PG:Z:MarkDuplicates     RG:Z:DRR016852  NM:i:3  AS:i:92 XS:i:96
DRR016852.66863023      163     1       10001   0       101M    =       10178   281     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   AAC@BAC@C@CCD@C?AAC@D?BACAD@BBCAD@BBCAD@BBDAD@BBDACACCCAD@BBDAC>DDB?C<ABA@C?CCCBDAEDEADBDDEBA@B>FCEAB     MC:Z:58M1I6M1I21M5D14M  MD:Z:101        PG:Z:MarkDuplicates     RG:Z:DRR016852  NM:i:0  AS:i:101        XS:i:101
DRR016852.73598611      99      1       10001   0       101M    =       10100   142     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC   BBDACCDAC@CCDAD@BBC@C?BBA?A@BBB?D@BC>AD;BBB?@=CCA?A;BA??<8?=-;79><B;)9<A@((:;AD7BAED?0?=:>?ADCDADAC83     MC:Z:43M        MD:Z:101        PG:Z:MarkDuplicates     RG:Z:DRR016852  NM:i:0  AS:i:101        XS:i:101
kcleal commented 9 months ago

Hi @wenyuhaokikika, That is a strange error, it suggests that the multiprocessing job can't be loaded properly causing the pickle error. Have you tried re-running the sample? I have not seen this error before, and not sure how this could happen. Alternatively, running in single thread mode should resolve it

wenyuhaokikika commented 9 months ago

Thank you, this problem has been solved~~~

When I ran it again I got a different exception in addition to the problem above. For example

2023-12-19 23:50:29,266 [INFO   ]  [dysgu-run] Version: 1.6.2
2023-12-19 23:50:29,266 [INFO   ]  run -x -p 6 /public/home/wenyuhao/seq/WGS/D1/resources/genome.fasta /public/home/wenyuhao/seq/WGS/dysgu/tmpDir /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016850-1.bam
2023-12-19 23:50:29,266 [INFO   ]  Destination: /public/home/wenyuhao/seq/WGS/dysgu/tmpDir
2023-12-19 23:53:10,416 [INFO   ]  dysgu fetch /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016850-1.bam written to /public/home/wenyuhao/seq/WGS/dysgu/tmpDir/DRR016850-1.dysgu_reads.bam, n=2989478, time=0:02:41 h:m:s
2023-12-19 23:53:10,416 [INFO   ]  Input file is: /public/home/wenyuhao/seq/WGS/dysgu/tmpDir/DRR016850-1.dysgu_reads.bam
2023-12-19 23:53:10,450 [INFO   ]  Sample name: DRR016850
2023-12-19 23:53:10,450 [INFO   ]  Writing vcf to stdout
2023-12-19 23:53:10,450 [INFO   ]  Running pipeline
2023-12-19 23:53:10,832 [INFO   ]  Calculating insert size. Removed 86 outliers with insert size >= 784
2023-12-19 23:53:10,843 [INFO   ]  Inferred read length 101.0, insert median 280, insert stdev 92
2023-12-19 23:53:10,844 [INFO   ]  Max clustering dist 740
2023-12-19 23:53:10,844 [INFO   ]  Building graph with clustering 740 bp
2023-12-19 23:53:37,925 [INFO   ]  Total input reads 2989478
2023-12-19 23:53:39,799 [INFO   ]  Graph constructed
2023-12-19 23:53:39,801 [INFO   ]  Minimum support 3
Traceback (most recent call last):
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/dysgu/main.py", line 259, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1188, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 996, in dysgu.cluster.pipe1
_pickle.UnpicklingError: unpickling stack underflow
Failed to read from standard input: unknown file type

or

2023-12-19 23:50:29,266 [INFO   ]  [dysgu-run] Version: 1.6.2
2023-12-19 23:50:29,266 [INFO   ]  run -x -p 6 /public/home/wenyuhao/seq/WGS/D1/resources/genome.fasta /public/home/wenyuhao/seq/WGS/dysgu/tmpDir /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016851-1.bam
2023-12-19 23:50:29,266 [INFO   ]  Destination: /public/home/wenyuhao/seq/WGS/dysgu/tmpDir
2023-12-19 23:53:12,329 [INFO   ]  dysgu fetch /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016851-1.bam written to /public/home/wenyuhao/seq/WGS/dysgu/tmpDir/DRR016851-1.dysgu_reads.bam, n=2995745, time=0:02:43 h:m:s
2023-12-19 23:53:12,329 [INFO   ]  Input file is: /public/home/wenyuhao/seq/WGS/dysgu/tmpDir/DRR016851-1.dysgu_reads.bam
2023-12-19 23:53:12,368 [INFO   ]  Sample name: DRR016851
2023-12-19 23:53:12,368 [INFO   ]  Writing vcf to stdout
2023-12-19 23:53:12,368 [INFO   ]  Running pipeline
2023-12-19 23:53:12,754 [INFO   ]  Calculating insert size. Removed 86 outliers with insert size >= 777.0
2023-12-19 23:53:12,765 [INFO   ]  Inferred read length 101.0, insert median 281, insert stdev 93
2023-12-19 23:53:12,766 [INFO   ]  Max clustering dist 746
2023-12-19 23:53:12,766 [INFO   ]  Building graph with clustering 746 bp
2023-12-19 23:53:39,518 [INFO   ]  Total input reads 2995745
2023-12-19 23:53:41,461 [INFO   ]  Graph constructed
2023-12-19 23:53:41,462 [INFO   ]  Minimum support 3
Traceback (most recent call last):
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/public/home/wenyuhao/anaconda3/envs/dysgu/lib/python3.8/site-packages/dysgu/main.py", line 259, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1188, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 996, in dysgu.cluster.pipe1
_pickle.UnpicklingError: invalid load key, '\x00'.
Failed to read from standard input: unknown file type

All are pickle errors. Finally I solved the problem by setting lower cpu cores for dysgu and more mem and more --cpus-per-task for slurm.

slurm file

#!/bin/bash
#SBATCH -J dysgu
#SBATCH --nodes=1
#SBATCH -n 1
#SBATCH --cpus-per-task=20
#SBATCH -p batch
#SBATCH -w comput5
#SBATCH --mem=200G
#SBATCH --export=ALL
#SBATCH -o log/output.log
#SBATCH -e log/error.log
#SBATCH --mail-type=FAIL # BEGIN,END,FAIL,ALL
#SBATCH --mail-user=925201392@qq.com
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/home/wenyuhao/anaconda3/lib/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/home/wenyuhao/anaconda3/pkgs/openssl-3.0.10-h7f8727e_2/lib/
parallel -j 3 <  run_dysgu.sh

and set '--procs' as 10 for dysgu.

dysgu run -x -p 6 /public/home/wenyuhao/seq/WGS/D1/resources/genome.fasta /public/home/wenyuhao/seq/WGS/dysgu/tmpDir /public/home/wenyuhao/seq/WGS/D1/results/recal/DRR016851-1.bam | bcftools view -Oz -o /public/home/wenyuhao/seq/WGS/dysgu/DRR016851.dysgu.vcf.gz && tabix -p vcf /public/home/wenyuhao/seq/WGS/dysgu/DRR016851.dysgu.vcf.gz > /public/home/wenyuhao/seq/WGS/dysgu/logs/DRR016851.log

Thank you so much ~~~