ryanlayer / samplot

Plot structural variant signals from many BAMs and CRAMs
MIT License
504 stars 66 forks source link

samplot vcf, only INV, DUP, DEL #188

Closed warthmann closed 7 months ago

warthmann commented 8 months ago

hello, thank you for this potentially useful tool. I have it running on delly output (bcf file) that had been filtered with the delly somatic filter. The file contains several SVTYPEs, namely DEL, DUP, INV, BND, INS, however, samplot vcf will only plot INV, DUP, DEL, but not BND and INS. I also cannot filter for those. I am now wondering whether this is a bug or a feature. I was able to visualise BNDs when using samplot plot specifying 2 locations (-c, -s, -e ) thank you very much best Norman

pontushojer commented 8 months ago

ATM samplot can only be used for visualising variants with SVTYPE equal to DEL, DUP, INV, BND and TRA.

Your BND events are interpreted as interchromosomal translocations (e.g. TRA). You would aslo need info about the second chromosome which is searched for at the INFO/CHR2x field. Unless your BND events are annotated with this field they will be skipped by samplot.

INS cannot be visualised ATM, for related discussion see https://github.com/ryanlayer/samplot/issues/129

warthmann commented 8 months ago

Thank you for this speedy response. My bcf file does have a INFO/CHR2 field. See an example entry below.

chr27 20730888 BND00007715 G G[chr26:5787808[ 330 PASS IMPRECISE;SVTYPE=BND;SVMETHOD=EMBL.DELLYv1.1.8;END=20730889;CHR2=chr26;POS2=5787808;PE=10;MAPQ=33;CT=3to5;CIPOS=-890,890;CIEND=-890,890;RDRATIO=1;SOMATIC GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/1:-26.3774,0,-64.1795:10000:PASS:11076:16635:5559:2:12:10:0:0 0/0:0,-1.50515,-30:15:PASS:6022:9622:3600:2:5:0:0:0

pontushojer commented 8 months ago

Yes it has the INFO/CHR2 field but not the INFO/CHR2x field. It seems from the relevant code that INFO/CHR2 has been allowed at some point

                # TODO add specific exception
                translocation_chrom = variant.info.get("CHR2x")
                # translocation_chrom = variant.info.get("CHR2")

We should probably make samplot a bit more flexible here to accept both info fields as different tools reports this in different ways.

warthmann commented 8 months ago

Dear @pontushojer, thank you for your time and identifying the issue so quickly. I'd be thankful for a fix, even temporary. Simply changing INFO/CHR2 to INFO/CHR2x in the vcf is not doing it.

pontushojer commented 8 months ago

Try this and let me know if it works for you

pip install git+https://github.com/pontushojer/samplot.git@fix-188
warthmann commented 8 months ago

install went well, running not so. (Note that I had previously used samplot-1.3.0) ... Installing collected packages: wget, samplot Attempting uninstall: samplot Found existing installation: samplot 1.3.0 Uninstalling samplot-1.3.0: Successfully uninstalled samplot-1.3.0 Successfully installed samplot-1.3.1 wget-3.2

$ samplot vcf --vcf S30GY_01vsControl.somatic-prefiltered.bcf -d BNDs/ -O png -b S30GY_01.bam SA_GN.bam > samplot_commands.sh Traceback (most recent call last): File "/home/norman/miniconda3/envs/samplot/bin/samplot", line 5, in from samplot.main import main File "/home/norman/miniconda3/envs/samplot/lib/python3.10/site-packages/samplot/main.py", line 8, in from .samplot_vcf import add_vcf File "/home/norman/miniconda3/envs/samplot/lib/python3.10/site-packages/samplot/samplot_vcf.py", line 935 except KeyError, ValueError as e: ^^^^^^^^^^^^^^^^^^^^^^^^^ SyntaxError: multiple exception types must be parenthesized

pontushojer commented 8 months ago

Sorry @warthmann, the syntax error should now be fixed. Please try and reinstall samplot now and see if it works.

warthmann commented 8 months ago

Thank you. Please explain "reinstall samplot". I've tried several variations, but with no success. Also, am I expected to use my original vcf with "CHR2" or my modified vcf file with "CHR2x"?

pontushojer commented 7 months ago

Thank you. Please explain "reinstall samplot". I've tried several variations, but with no success. Also, am I expected to use my original vcf with "CHR2" or my modified vcf file with "CHR2x"?

Certainly. First uninstall the current samplot

pip uninstall samplot

Then reinstall samplot from the branch with the fix.

pip install git+https://github.com/pontushojer/samplot.git@fix-188

You should use the original VCF with the CHR2 tag.

If you run into further issue it would be were great if you could provide some testdata. E.g. a small VCF containing the translocation and a BAM that containing the regions surrounding the variant.

warthmann commented 7 months ago

ok, before I prepare/provide files, maybe the error message is enough for now. It is the same that I had gotten when manually modifying my vcf from CHR2 to CHR2x. There seems some parsing error.

(samplot) samplot$ samplot vcf --filter "SVTYPE == 'BND'" --vcf somatic-prefiltered.bcf -d BNDs/ -O png -b tumor.bam control.bam > samplot_commands.sh usage: samplot plot [-h] [-n TITLES [TITLES ...]] [-r REFERENCE] [-z Z] -b BAMS [BAMS ...] [-o OUTPUT_FILE] [--output_dir OUTPUT_DIR] -s START -e END -c CHROM [-w WINDOW] [-d MAX_DEPTH] [-t SV_TYPE] [-T TRANSCRIPT_FILE] [--transcript_filename TRANSCRIPT_FILENAME] [--max_coverage_points MAX_COVERAGE_POINTS] [-A ANNOTATION_FILES [ANNOTATION_FILES ...]] [--annotation_filenames ANNOTATION_FILENAMES [ANNOTATION_FILENAMES ...]] [--coverage_tracktype {stack,superimpose,none}] [-a] [-H PLOT_HEIGHT] [-W PLOT_WIDTH] [-q INCLUDE_MQUAL] [--separate_mqual SEPARATE_MQUAL] [-j] [--start_ci START_CI] [--end_ci END_CI] [--long_read LONG_READ] [--ignore_hp] [--min_event_size MIN_EVENT_SIZE] [--xaxis_label_fontsize XAXIS_LABEL_FONTSIZE] [--yaxis_label_fontsize YAXIS_LABEL_FONTSIZE] [--legend_fontsize LEGEND_FONTSIZE] [--annotation_fontsize ANNOTATION_FONTSIZE] [--hide_annotation_labels] [--coverage_only] [--max_coverage MAX_COVERAGE] [--same_yaxis_scales] [--marker_size MARKER_SIZE] [--jitter [JITTER]] [--dpi DPI] [--annotation_scalar ANNOTATION_SCALAR] [--zoom ZOOM] [--debug DEBUG] samplot plot: error: argument -s/--start: invalid int value: '18175971-e'

pontushojer commented 7 months ago

Great that you posted the log, this actually seems to be a separate error now!

I found a bug in the code where there is a missing space in the command generated for translocations, see https://github.com/ryanlayer/samplot/pull/191/commits/b3a2f7ae79a90513996584a5d4906afe6d9aea34.

Could you please try reinstalling and running again ;)

warthmann commented 7 months ago

Dear @pontushojer , I can confirm that this is running now, on the original .bcf file with "CHR2". Thank you very much for your efforts. I am hoping that the output is correct, though?!

pontushojer commented 7 months ago

@warthmann great to hear that it works now!

warthmann commented 7 months ago

Yes, it does run, but I believe it is not plotting the expected. It seems samplot vcf is not plotting the second locus correctly, but rather continues on locus 1. See attached pictures of one example BND, particularly the (intended) tick axis labels on the x. One created with samplot plot (top), the other with samplot vcf (bottom). (disregard the erroneous tick labels, they seem to be due to a too recent matplotlib, see issue #189 ) samplot-plot-chr29-6 samplot-vcf-BND_chr29_17392082_chr06_17392084

pontushojer commented 7 months ago

That does indeed look strange.

If you run samplot vcf with the option --manual_run it will output the samplot plot commands used to generate the figures to a file called "samplot_vcf_cmds.tmp" (or whatever you specify with the option --command_file). If you run this, you can compare the command generated by samplot vcf to the samplot plot command you used above to generate the "correct" plot.

Please also post all commands that would make it easier to troubleshoot.

warthmann commented 7 months ago

thanks! yes, this helped. There is a parsing/pasting error somewhere, such that not the position of CHR2 is used (POS2) but rather a continuation of positions of the first. See below.

This is the line from the vcf chr29 17392083 BND00008582 A A[chr06:22567816[ 360 PASS IMPRECISE;SVTYPE=BND;SVMETHOD=EMBL.DELLYv1.1.8;END=17392084;CHR2=chr06;POS2=22567816;PE=6;MAPQ=60;CT=3to5;CIPOS=-890,890;CIEND=-890,890;RDRATIO=1;SOMATIC GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/1:-28.7753,0,-100.775:10000:PASS:26074:47325:21251:2:18:6:0:0 0/0:0,-3.31133,-66:33:PASS:2808:18152:15344:2:11:0:0:0

and this is the command, extracted with _samplot vcf --manualrun. Note that the positions in the last line are incorrect!

samplot plot \ -z 3 \ -n S30GY_01 control-sample:SA_GN \ --start_ci '890,890' --end_ci '890,890' -t BND \ -c chr29 -s 17392082 -e 17392083 \ -o BNDs/BND_chr29_17392082_chr06_17392084.png \ -d 1 \ -b S30GY_01.bam SA_GN.bam \ -c chr06 -s 17392084 -e 17392085

pontushojer commented 7 months ago

This is very useful! If seems that samplot takes the 2nd breakpoint position from the INFO/END tag which is 17392084. Therefor the samplot plot command is wrong here. After some searching it seems like this tag should not be used to encode this information. Delly uses INFO/POS2 (22567816 here) for this so I updated the code to work with this.

It would be very good to include this variant as a test case. If you have the time it would be great if you could upload a VCF containing this variant along with a BAM with the reads surrounding the breakpoints.

warthmann commented 7 months ago

Seems to be working, thanks! I hope that those changes will not have consequences for plotting of any of the other types. Strange that we should be the first to notice a bug of this magnitude.

warthmann commented 7 months ago

Hello again @pontushojer, I am using the next tool, dysgu (https://github.com/kcleal/dysgu), which seems pretty good. Ie, it did not skip a known translocation as delly did, however, when using samplot on this I am experiencing 2 issues: 1) samplot vcf is only plotting the tumor sample and not the other 2) the same translocation/position problem as above

Dysgu labels translocations with 'TRA' and the position of the other end of a translocation is encoded in INFO/CHR2_POS, as opposed to INFO/POS2. Again, samplot seems to use INFO/END. Would you mind including this case, too? And any idea why it is not plotting both of them? I am providing both bams as per usual.

my command samplot vcf \ --vcf S30GY_01-vs-SA_GN.vcf \ -d all-plots/ -O png \ -b S30GY_01.bam \ SA_GN.bam \ > samplot_commands.sh

From the VCF header

INFO=

INFO=

INFO=

Example TRA from vcf chr29 2368050 14371 C . lowProb SVMETHOD=DYSGUv1.6.2;SVTYPE=TRA;END=2368051;CHR2=chr32;CHR2_POS=20919096;GRP=14371;NGRP=1;CT=5to5;CIPOS95=0;CIEN>

the --manual run output for this TRA samplot plot -z 3 -n S30GY_01 -t TRA -c chr29 -s 2368049 -e 2368050 -o all-plots/TRA_chr29_2368049_chr32_2368051.png -d 1 -b S30GY_01.bam -c chr32 -s 2368051 -e 2368052

thanks a lot for your help!

pontushojer commented 7 months ago

I added CHR2_POS as an option to get the 2nd breakpoint to the PR.

For the first issue I need more info. Please rerun samplot vcf with the option --debug and paste the output. Also could you provide a VCF with the variant in question, e.g.

bcftools view S30GY_01-vs-SA_GN.vcf chr29:2368050-2368051 > example.vcf
pontushojer commented 7 months ago

Great!

warthmann commented 7 months ago

sorry. I replied too early. The translocaton regions seem ok now, but it is still plotting only the sample (but not the control) and that is irrespective of SVTYPE. I am attaching the requested, although the debug output seems uninformative. I am also attaching a --manual_run output. samplot-debug-output.txt example.vcf.gz samplot_vcf_cmds.tmp.txt

pontushojer commented 7 months ago

I think I figured out the reason. In your VCF only the sample "S30GY_01" is present.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S30GY_01 

As the control "SA_GN" (?) is missing from the VCF it will not be used to generate the samplot plot commands. I added a new PR to warn about this https://github.com/ryanlayer/samplot/pull/193.

warthmann commented 7 months ago

ahh! thank you for connecting these dots. yes, true, somatic filtering in dysgu returns a vcf with only one sample. Weird that the 2nd sample present in the vcf should be necessary. Is the FORMAT information per sample even used by samplot? I tried to fix it by supplying samplot with "--sample_ids S30GY_01 SA_GN", but same issue. Not sure what to do.

pontushojer commented 7 months ago

ahh! thank you for connecting these dots. yes, true, somatic filtering in dysgu returns a vcf with only one sample. Weird that the 2nd sample present in the vcf should be necessary. Is the FORMAT information per sample even used by samplot?

It is used, this it the reason this is not working ;) All sample need to be defined in the VCF. In this case you need to add the control sample to your existing VCF. The FORMAT field tells if the variant has been called by a sample or not. Since this is somatic calls only the VCF FROMAT fields should be something like this:

#CHROM  POS     ID      REF     ALT     ...    FORMAT  S30GY_01    SA_GN
chr29   2368050 14371   C       <TRA>   ...    GT:..   0/1:...     0/0:...

Note that for the control the genotype is hom-REF.

I tried to fix it by supplying samplot with "--sample_ids S30GY_01 SA_GN", but same issue. Not sure what to do.

--sample_ids is only for specifying BAM names. Say if you have a different BAM file name than your VCF sample you can use this to map the correct name to the correct BAM file.

warthmann commented 7 months ago

ok. still not sure what information from the FORMAT fields samplot is actually using, and hence why its needed for plotting. Besides the positions, the data that gets plotted is from the bam files. As a workaround I am now using samplot vcf --manual_run and then modify the samplot plot commands to plot both samples.

pontushojer commented 7 months ago

Maybe it was unclear above but the FORMAT/GT field is required by samplot vcf. The GT tells if the variant is present in the sample or not. If not i.e. GT is 0/0, it will still be included (depending a bit on how many samples you have in your VCF, by default for each variant up to 6 samples will be included (see --min_entries and --max_entries)) in the plot by labeled "control".

In you case you can just copy all the calls to a new VCF, rename the sample (SA_GN), change the GT field to 0/0 for all and merge this control VCF with the original.

warthmann commented 7 months ago

Hello, no. You were very clear. samplot vcf plots only samples present in the respective vcf file and apparently reads the FORMAT/GT field for each sample. However, it is unclear to me what this information is used for. To me, the expected behavior was that samplot vcf plots all positions from the vcf for up to 6 samples I am providing a sam/bam file for. For a quick workaround I found it easier to modify the samplot plot commands output by samplot vcf --manual_run. I am now checking upstream. Seems that dysgu has genotyping and merging commands to produce the required vcf with both samples.