kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
92 stars 11 forks source link

Error with --search option #64

Closed husamia closed 1 year ago

husamia commented 1 year ago

I am getting an error with the command --search that I can't figure out

this is 30x WGS data. I want to get SVs from only chromosome 17 so I provided the bed file and used the --search PATH .bed file, limit search to regions option

root@d0ce0b963ba0:/data# cat /data/Craniofacial/chr17_GAA_hg19_region.bed
chr17   1       159138663
root@d0ce0b963ba0:/data# dysgu run --search /data/Craniofacial/chr17_GAA_hg19_region.bed -o /data/Craniofacial/CARC012_dysgu_SV.vcf -p 20 --mode pe --clean Homo_sapiens_assembly19_1000genomes_decoy.fasta /data/Craniofacial/CARC_012_001_chr17 /data/Craniofa
cial/CARC_012_001.bam
2023-08-08 14:03:48,518 [INFO   ]  [dysgu-run] Version: 1.5.0
2023-08-08 14:03:48,522 [INFO   ]  run --search /data/Craniofacial/chr17_GAA_hg19_region.bed -o /data/Craniofacial/CARC012_dysgu_SV.vcf -p 20 --mode pe --clean Homo_sapiens_assembly19_1000genomes_decoy.fasta /data/Craniofacial/CARC_012_001_chr17 /data/Craniofacial/CARC_012_001.bam
2023-08-08 14:03:48,522 [INFO   ]  Destination: /data/Craniofacial/CARC_012_001_chr17
2023-08-08 14:03:48,524 [INFO   ]  Searching regions from /data/Craniofacial/chr17_GAA_hg19_region.bed
2023-08-08 14:04:34,736 [INFO   ]  dysgu fetch /data/Craniofacial/CARC_012_001.bam written to /data/Craniofacial/CARC_012_001_chr17/CARC_012_001.dysgu_reads.bam, n=218647, time=0:00:46 h:m:s
2023-08-08 14:04:34,754 [INFO   ]  Input file is: /data/Craniofacial/CARC_012_001_chr17/CARC_012_001.dysgu_reads.bam
2023-08-08 14:04:34,879 [INFO   ]  Sample name: CARC_012_001
2023-08-08 14:04:34,880 [INFO   ]  Writing SVs to /data/Craniofacial/CARC012_dysgu_SV.vcf
2023-08-08 14:04:34,884 [INFO   ]  Running pipeline
2023-08-08 14:04:36,568 [INFO   ]  Calculating insert size. Removed 0 outliers with insert size >= 717.0
2023-08-08 14:04:36,586 [INFO   ]  Inferred read length 150.0, insert median 325, insert stdev 75
2023-08-08 14:04:36,588 [INFO   ]  Max clustering dist 700
2023-08-08 14:04:36,588 [INFO   ]  Minimum support 3
2023-08-08 14:04:36,588 [INFO   ]  Building graph with clustering 700 bp
2023-08-08 14:04:41,525 [INFO   ]  Total input reads 218647
2023-08-08 14:04:41,705 [INFO   ]  Graph constructed
2023-08-08 14:04:50,490 [INFO   ]  Number of components 184739. N candidates 4011
Traceback (most recent call last):
  File "/opt/venv/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/dysgu/main.py", line 256, in run_pipeline
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1079, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 957, in dysgu.cluster.pipe1
  File "/opt/venv/lib/python3.9/site-packages/dysgu/re_map.py", line 365, in remap_soft_clips
    if ref_genome.get_reference_length(e.chrA) < 1000 or ref_genome.get_reference_length(e.chrB) < 1000:
  File "pysam/libcfaidx.pyx", line 347, in pysam.libcfaidx.FastaFile.get_reference_length
KeyError: 'chr17'
kcleal commented 1 year ago

Hi @husamia, does the reference genome use 17 instead of chr17? The bam file looks like it uses the chr17 convention, but not the reference? check with head Homo_sapiens_assembly19_1000genomes_decoy.fasta.fai

husamia commented 1 year ago

@kcleal I've tried replacing chr17 and just 17 in the bed file and it didn't help. I even used samtools view -b chr17 original.bam > chr17.bam and removed the --search and still getting the same error!

kcleal commented 1 year ago

I think the bam file is consistent with chr17 - the rest of the pipeline works ok. I was wondering about the reference genome. The error message is saying that chr17 is not in the fasta file. Tne functionref_genome.get_reference_length is throwing the key error. Could you confirm this?

husamia commented 1 year ago

@kcleal here is the check

head /mnt/d/Research/Homo_sapiens_assembly19_1000genomes_decoy.fasta.fai 1 249250621 52 100 101 2 243199373 251743232 100 101 3 198022430 497374651 100 101 4 191154276 697377358 100 101 5 180915260 890443229 100 101 6 171115067 1073167694 100 101 7 159138663 1245993964 100 101 8 146364022 1406724066 100 101 9 141213431 1554551781 100 101 10 135534747 1697177401 100 101

kcleal commented 1 year ago

Ah ok, thats the problem - the 'chr' is missing from the chromosome names. You can add this in using sed/awk, and then re-index the genome. It should read chr1 249250621 52 100 101. Alternatively, you should probably use the same reference genome that the sample was aligned to. You can normally check using samtools view -H and look at the command used during mapping along with the reference genome.

husamia commented 1 year ago

@kcleal the reference used is the generic hg19.fasta which I have two copies of downloaded from different sources and both are the same. furthermore, I can open the file in IGV and the chr17 file created with samtools in IGV as well

kcleal commented 1 year ago

I think IGV does automatic conversion between the two representations. This is not supported in dysgu unfortunately

husamia commented 1 year ago

@kcleal this is an issue.

the reference is generic named hg19.fasta which is not provided with the data. I used the reference with many other tools just fine. so can a feature added to do the conversion. Provide me me the awk/sed command to add the chr to the file?

kcleal commented 1 year ago

Its a problem I have encountered before with other genomics analysis, the different representations can be a pain. However doing automatic conversion can also cause problems. I recommend you download the hg19 with the chr representation from ucsc TableBrowser.

kcleal commented 1 year ago

Actually from here: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/

husamia commented 1 year ago

@kcleal I got it working after I downloaded the reference, uncompressed it, compressed it with bgzip and indexed it with samtools.