kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
88 stars 10 forks source link

Type error: '<' not supported between instances of 'bool' and 'str' #51

Closed distilledchild closed 11 months ago

distilledchild commented 1 year ago

Hi, First of all, thank you for the great tool, dysgu! I am running it with my bam file from 10x genomics linked-read sequencing, and I found that error keep popping up. Could you give me some advice please?

2022-12-04 15:31:26,573 [INFO ] Sample name: SHR_OlaIpcv 2022-12-04 15:31:26,575 [INFO ] Writing vcf to stdout 2022-12-04 15:31:26,575 [INFO ] Running pipeline 2022-12-04 15:31:27,337 [INFO ] Calculating insert size. Removed 735 outliers with insert size >= 1033.0 2022-12-04 15:31:27,345 [INFO ] Inferred read length 148.0, insert median 302, insert stdev 128 2022-12-04 15:31:27,362 [INFO ] Max clustering dist 942 2022-12-04 15:31:27,362 [INFO ] Minimum support 3 2022-12-04 15:31:27,372 [INFO ] Building graph with clustering 942 bp Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str' Exception ignored in: 'dysgu.graph.process_alignment' Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str' Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str' Exception ignored in: 'dysgu.graph.process_alignment' Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str' Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str' Exception ignored in: 'dysgu.graph.process_alignment' Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str'

kcleal commented 1 year ago

Hi @theshowmustgolangon, Thanks for reporting this. It looks like there is some problem interpreting the SA tag. What aligner did you use? If you are able to share a few example reads with SA tags, that would be very helpful for debugging

distilledchild commented 1 year ago

@kcleal Thank you for your support and help! The aligner I used is longranger (linked-read specific aligner) and a few lines are here.

A00735:93:HLHKJDSXX:1:1366:10312:23797 321 chr1 5 0 64M64H chr17 63605199 0 CAATCAAACACAGCATCCTTTTCAACAGAAGCAGAAGCTCATCTGAATATGCTCAAGGATGCTG FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF RX:Z:TCTGTCGCAGAAGCAC QX:Z:FFFFFFFFFFFFFFFF TR:Z:CGGACCA TQ:Z:FFFFFFF XS:i:-73 AS:i:-41 XM:A:0 AM:A:0 XT:i:0 SA:Z:chr17,63605286,-,54S74M,2,0; BX:Z:TCTGTCGCAGAAGCAC-1 RG:Z:SHR_OlaIpcv:LibraryNotSpecified:1:unknown_fc:0 OM:i:0 A00735:93:HLHKJDSXX:3:2517:6723:1579 353 chr1 5 0 64M64H chr11 3713885 0 CAATCAAACACAGCATCCTTTTCAACAGAAGCAGAAGCTCATCTGAATATGCTCAAGGATGCTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,F RX:Z:TCTGTCGCAGAAGCAC QX:Z:FFFFFFFFFFFFFFFF TR:Z:CGGACCA TQ:Z:FFFFFFF XS:i:-75 AS:i:-41 XM:A:0 AM:A:0 XT:i:0 SA:Z:chr11,3713873,+,54S74M,1,1; BX:Z:TCTGTCGCAGAAGCAC-1 RG:Z:SHR_OlaIpcv:LibraryNotSpecified:1:unknown_fc:0 OM:i:0 A00735:93:HLHKJDSXX:3:2162:1226:36839 385 chr1 5 0 64M85H chr10 90727447 0 CAATCAAACACAGCATCCTTTTCAACAGAAGCAGAAGCTCATCTGAATATGCTCAAGGATGCTG FFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RX:Z:CCGTTACAGGTAGTCG QX:Z:F:FFFFFFFFFFFFFF XS:i:-83 AS:i:-54 XM:A:0 AM:A:0 XT:i:0 SA:Z:chr10,90727562,-,54S95M,0,0; BX:Z:CCGTTACAGGTAGTCG-1 RG:Z:SHR_OlaIpcv:LibraryNotSpecified:1:unknown_fc:0 OM:i:0 A00735:93:HLHKJDSXX:4:2644:6958:19930 321 chr1 5 0 65M63H chr16 32793676 0 CAATCAAACACAGCATCCTTTTCAACAGAAGCAGAAGCTCATCTGAATATGCTCAAGGATGCTGA FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, RX:Z:TCAGGTAGTCCACTCT QX:Z:FFFFFFFFFFFFFFFF TR:Z:CGGACCA TQ:Z:FFFFFFF XS:i:-77 AS:i:-40 XM:A:0 AM:A:0 XT:i:0 SA:Z:chr16,32793830,-,54S63M1D11M,13,2; BX:Z:TCAGGTAGTCCACTCT-1 RG:Z:SHR_OlaIpcv:LibraryNotSpecified:1:unknown_fc:0 OM:i:0 A00735:93:HLHKJDSXX:2:1435:2817:4554 417 chr1 5 0 65M86H chr8 51900376 0 CAATCAAACACAGCATCCTTTTCAACAGAAGCAGAAGCTCATCTGAATATGCTCAAGGATGCTGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RX:Z:CTGCGGACAGGGAAGG QX:Z:FFFFFFFFFFFFFFFF XS:i:-94 AS:i:-52 XM:A:0 AM:A:0 XT:i:0 SA:Z:chr8,51900223,+,74S77M,0,0; BX:Z:CTGCGGACAGGGAAGG-1 RG:Z:SHR_OlaIpcv:LibraryNotSpecified:1:unknown_fc:0 OM:i:0

kcleal commented 1 year ago

I think the SA tags looks fine. Im a bit confused by the error at the moment. Does this error pop up only a handful of times, or does it seem like every read with an SA tag?

distilledchild commented 1 year ago

@kcleal not every read, but some reads I think. It stops in the middle of running due to memory issue(out of memory). I am running it again now, and let you know.

distilledchild commented 1 year ago

@kcleal Hi, after using a plenty of computational resources, I got this error and could you take a look at it please?

Traceback (most recent call last): File "dysgu/graph.pyx", line 754, in dysgu.graph.alignments_from_sa_tag TypeError: '<' not supported between instances of 'bool' and 'str' [E::bgzf_read_block] Failed to read BGZF header at offset 61488496640 [E::bgzf_read] Read block operation failed with error 6 after 0 of 4 bytes OSError: [Errno 5] Input/output error Exception ignored in: 'pysam.libcalignmentfile.AlignmentFile.dealloc' Traceback (most recent call last): File "/dysgu/lib/python3.10/site-packages/dysgu/main.py", line 280, in run_pipeline cluster.cluster_reads(ctx.obj) OSError: [Errno 5] Input/output error Traceback (most recent call last): File "/dysgu/bin/dysgu", line 8, in sys.exit(cli()) File "/dysgu/lib/python3.10/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/dysgu/lib/python3.10/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/dysgu/lib/python3.10/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/dysgu/lib/python3.10/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/dysgu/lib/python3.10/site-packages/click/core.py", line 760, in invoke return __callback(args, *kwargs) File "/dysgu/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), args, **kwargs) File "/dysgu/lib/python3.10/site-packages/dysgu/main.py", line 280, in run_pipeline cluster.cluster_reads(ctx.obj) File "dysgu/cluster.pyx", line 1337, in dysgu.cluster.cluster_reads File "dysgu/cluster.pyx", line 920, in dysgu.cluster.pipe1 File "dysgu/graph.pyx", line 1249, in dysgu.graph.construct_graph File "dysgu/graph.pyx", line 1303, in dysgu.graph.construct_graph File "dysgu/coverage.pyx", line 464, in iter_genome File "dysgu/coverage.pyx", line 166, in _get_reads

kcleal commented 1 year ago

I think these errors are a bit weird:

[E::bgzf_read_block] Failed to read BGZF header at offset 61488496640
[E::bgzf_read] Read block operation failed with error 6 after 0 of 4 bytes

This indicates a problem reading the alignment file (using pysam). Possibly the bam file is corrupted, that's what the error seems to indicate

distilledchild commented 1 year ago

@kcleal I will re-install pysam again, and re-run it. Thank you!

kcleal commented 1 year ago

You can also validate your bam using picardtools https://gatk.broadinstitute.org/hc/en-us/articles/360036854731-ValidateSamFile-Picard-

distilledchild commented 1 year ago

@kcleal Thank you. I will try that one now!

distilledchild commented 1 year ago

You can also validate your bam using picardtools https://gatk.broadinstitute.org/hc/en-us/articles/360036854731-ValidateSamFile-Picard-

I got this error, probably it makes error because of it?

ValidateSamFile Value was put into PairInfoMap more than once. 0: A00735:93:HLHKJDSXX:4:2513:24478:4366 [Mon Dec 05 14:56:05 EST 2022] picard.sam.ValidateSamFile done. Elapsed time: 95.50 minutes.

kcleal commented 1 year ago

Its possible. Perhaps this biostars post will help https://www.biostars.org/p/60263/

distilledchild commented 1 year ago

@kcleal Hi, I took a step to remove duplicates based on here, https://www.biostars.org/p/365882/ and I ran it again. Firstly, The error, I reported first doesn't go away, and just I feel I can ignore them. Second, I found I get an error, core dump like this.

var/spool/slurm/spool/job302969/slurm_script: line 24: 2867650 Bus error (core dumped) dysgu run .......

And I read a few posts related to Bus error, and I changed the value of core to 1, so my command is

dysgu run -p1 --clean \ --mode pe \ --min-support 3 \ --min-size 50 \ --max-cov auto \ --contigs False \ --low-mem \ --exclude ${BASE_DIR}/input/gap_region_rn7chr_ucsc.bed \ -x \ REF, output, input... > vcf

I will let you know how it works, and please tell me if you have any solutions for it.

Thank you for your support.

kcleal commented 1 year ago

The command looks fine. Possibly the low-mem option might be causing an issue. Other than that, I am happy to try and degug it for you if you don't mind sending me a small region from your data (as long as I can reproduce the error). A bus error is a bit surprising, I am not really sure what could be causing it

distilledchild commented 1 year ago

Would it be fine to get a bam file ?! If so, I can give a link to download from a cloud.. please let me know!

kcleal commented 1 year ago

Yes, also I will need to know what ref genome. If possible could you send me a subset e.g. chr21 rather than the whole bam

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: a gopher @.> Sent: Thursday, December 8, 2022 5:34:37 PM To: kcleal/dysgu @.> Cc: Kez Cleal @.>; Mention @.> Subject: Re: [kcleal/dysgu] Type error: '<' not supported between instances of 'bool' and 'str' (Issue #51)

External email to Cardiff University - Take care when replying/opening attachments or links. Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni.

Would it be fine to get a bam file ?! If so, I can give a link to download from a cloud.. please let me know!

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkcleal%2Fdysgu%2Fissues%2F51%23issuecomment-1343065727&data=05%7C01%7Cclealk%40cardiff.ac.uk%7C4eb23e1848bd466e795e08dad9427b81%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C638061176798467825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CcTsqIbGj3ROT57%2B5M3KpmJmv91Zv9kj2GBgKjErFvk%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKIBQHMAODOKLTDUH27G7ZDWMIL23ANCNFSM6AAAAAASTR5AME&data=05%7C01%7Cclealk%40cardiff.ac.uk%7C4eb23e1848bd466e795e08dad9427b81%7Cbdb74b3095684856bdbf06759778fcbc%7C1%7C0%7C638061176798467825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZQ%2FwmP41STFOH1rxtUTt9LEc1%2FoNrCb%2BC1CSXc9v8t4%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

distilledchild commented 1 year ago

@kcleal I extract a bam for chr12 and chr2 and I found that a bam for chr12 causes core dump. If it is ok to run dysgu chromosome by chromosome, I will try that all. And please let me know your email for me to give you a link!

distilledchild commented 1 year ago

@kcleal Here is the link for download. Could you check it please? I am not sure that I was not able to find core-dump (probabley due to increasing RAM), but you can find the error, the type error. and could you check this log please? UserWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.1.3. This might lead to breaking code or invalid results. Use at your own risk.

Additional questions is that can I use results from bams splitted by chromosomes ? I mean, If SVs are not translocations, the rest of SV types could be valid, I think.

kcleal commented 1 year ago

Hi @theshowmustgolangon,

Ive fixed the TypeError bug - it was caused by a read having an identical supplementary alignment to the primary alignment. It looks like this would occur rarely so would be unlikely to affect the output. The UserWarning can be safely ignored, it is left in there to flag potential issues in the future. You can build dysgu 1.3.14 from source if you want to run again, or you can wait a few days whilst I get a release to pypi

distilledchild commented 1 year ago

@kcleal I will compile the source code, and thank you for your supports and helps. I really appreciate it.

distilledchild commented 1 year ago

@kcleal Hi, I am trying to run with newer version, 1.3.14, but I was not able to do due to core dump error. Resources like CPUs and memory were ~50 and 2T. (Even I ran the tool with 1 CPU based on your comment in this issue board, but it was not successful.) Could you give me an advice please?

kcleal commented 1 year ago

Can you post the full output log from running dysgu? I will see if I can help

distilledchild commented 1 year ago

This is an example of my log. 2022-12-13 14:02:16,095 [INFO ] [dysgu-run] Version: 1.3.14 2022-12-13 14:02:16,095 [INFO ] run -p22 --mode pe --min-support 3 --min-size 50 --merge-within True --drop-gaps True --max-cov auto --low-mem --contigs False --exclude /dysgu/input/gap_region_rn7chr_ucsc.bed -x /refs/rn7_ucsc/rn7chr.fa /dysgu/output/SHR_OlaIpcv /dysgu/input/SHR_OlaIpcv/SHR_OlaIpcv_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.bam 2022-12-13 14:02:16,095 [INFO ] Destination: /dysgu/output/SHR_OlaIpcv 2022-12-13 14:02:16,097 [INFO ] Excluding /dysgu/input/gap_region_rn7chr_ucsc.bed from search 2022-12-13 14:02:16,194 [INFO ] Auto max-cov estimated 294x 2022-12-13 15:08:56,965 [INFO ] dysgu fetch /dysgu/input/SHR_OlaIpcv/SHR_OlaIpcv_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.bam written to /dysgu/output/SHR_OlaIpcv/SHR_OlaIpcv_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.dysgu_reads.bam, n=153333031, time=1:06:40 h:m:s 2022-12-13 15:08:57,003 [INFO ] Input file is: /dysgu/output/SHR_OlaIpcv/SHR_OlaIpcv_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.dysgu_reads.bam 2022-12-13 15:08:57,158 [INFO ] Sample name: SHR_OlaIpcv 2022-12-13 15:08:57,159 [INFO ] Writing vcf to stdout 2022-12-13 15:08:57,159 [INFO ] Running pipeline 2022-12-13 15:08:57,799 [INFO ] Calculating insert size. Removed 799 outliers with insert size >= 1040.0 2022-12-13 15:08:57,811 [INFO ] Inferred read length 148.0, insert median 301, insert stdev 128 2022-12-13 15:08:57,813 [INFO ] Max clustering dist 941 2022-12-13 15:08:57,815 [INFO ] Minimum support 3 2022-12-13 15:08:57,821 [INFO ] Building graph with clustering 941 bp 2022-12-13 16:33:38,105 [INFO ] Total input reads 152447351 2022-12-13 16:34:59,418 [INFO ] Graph constructed /var/spool/slurm/spool/job309515/slurm_script: line 43: 4192427 Bus error (core dumped) dysgu run -p${CPU} --mode pe --min-support 3 --min-size 50 --merge-within True --drop-gaps True --max-cov auto --low-mem --contigs False --exclude ${BASE_DIR}/input/gap_region_rn7chr_ucsc.bed -x ${REF}/rn7chr.fa ${BASE_DIR}/output/${SAMPLE} ${BASE_DIR}/input/${SAMPLE}/${SAMPLE}_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.bam > ${BASE_DIR}/output/${SAMPLE}/${SAMPLE}_dedup_dysgu_sv.1.3.14.vcf

kcleal commented 1 year ago

Thanks. Ive just been looking over the data you sent me. There appears to be a lot of reads with soft-clips - this might be the source of the high memory issue. I can recommend trying adpater trimming if you have not already done so. Also you might be able to bypass the issue by increasing the --clip-length to e.g. 50.

If you are able to send me the SHR_OlaIpcv_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.dysgu_reads.bam file, I will be able to investigate further.

Screenshot 2022-12-15 at 09 16 18
distilledchild commented 1 year ago

@kcleal Thank you for your advice! I am uploading now, but it's pretty big, 74G. Also, I am running it with the option with 50. I will let you know after uploading and running.

kcleal commented 1 year ago

Thanks. Just to make sure your uploading the right file - I dont need the original bam file, just the "dysgu_reads" file from the working_directory SHR_OlaIpcv_phased_possorted_bam.nmsorted.fixmate.possorted.dedup.dysgu_reads.bam

kcleal commented 11 months ago

I wasn't able to download the file. Ill close this for now, as the issue is probably related to either the number of soft-clipped reads, or the fact they are linked-reads.