ryanlayer / samplot

Plot structural variant signals from many BAMs and CRAMs
MIT License
507 stars 67 forks source link

Trouble generating images from a VCF file: Doesn't work on my data or sample data #101

Closed moldach closed 4 years ago

moldach commented 4 years ago

I'm able to run the samplot.py on the example data and on my own data but am having issues when I try the samplot_vcf.py script.

As far as I know there is no Repeat Masker track for C. elegans (at least not on UCSC) so the --important_regions regions.bed has been omitted.

python ~/bin/samplot/src/samplot_vcf.py \
    --vcf Pindel.vcf\
    -d test/\
    -O png\
    -b 470.sorted.dedupped.bam > samplot_commands.sh

[W::bcf_hdr_check_sanity] PL should be declared as Number=G

I see an empty samplot_commands.sh file and when I go to open the index.html I see zero records.

Because this did not work I want to validate on the example data; however, I get similar results using all of the .bam files provided with samplot:

cd ~/bin/samplot/test/data/
python ~/bin/samplot/src/samplot_vcf.py --vcf test.vcf -d vcf-test/ -O png -b NA12878_restricted.bam > samplot_commands.sh

Your help would be greatly appreciated

jbelyeu commented 4 years ago

Hi, Matthew, sorry about this problem. There is a bug in that version of the samplot_vcf code and we're in the middle of a version transition. Could you try installing via conda instead? This will give you a different version of samplot in which that problem has been fixed. It will also slightly alter the way samplot is invoked.

To install with conda, just run "conda install samplot"

To run the samplot vcf command: "samplot vcf" instead of "python samplot_vcf.py".

Please let me know how that works for you! Happy to assist further.

On Tue, Apr 14, 2020, 20:27 Matthew J. Oldach notifications@github.com wrote:

I'm able to run the samplot.py on the example data and on my own data but am having issues when I try the samplot_vcf.py script.

As far as I know there is no Repeat Masker track for C. elegans (at least not on UCSC) so the --important_regions regions.bed has been omitted.

python ~/bin/samplot/src/samplot_vcf.py \ --vcf Pindel.vcf\ -d test/\ -O png\ -b 470.sorted.dedupped.bam > samplot_commands.sh

[W::bcf_hdr_check_sanity] PL should be declared as Number=G

I see an empty samplot_commands.sh file and when I go to open the index.html I see zero records.

Because this did not work I want to validate on the example data; however, I get similar results using all of the .bam files provided with samplot:

cd ~/bin/samplot/test/data/ python ~/bin/samplot/src/samplot_vcf.py --vcf test.vcf -d vcf-test/ -O png -b NA12878_restricted.bam > samplot_commands.sh

Your help would be greatly appreciated

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ryanlayer/samplot/issues/101, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSSNRYAEA6XFJIVQUSLBY3RMULPXANCNFSM4MIGVLLQ .

moldach commented 4 years ago

To install with conda, just run "conda install samplot". To run the samplot vcf command: "samplot vcf" instead of "python samplot_vcf.py".

I've installed viz conda and run into the same issue (both on my data and test data from samplot repo)

samplot vcf \
    --vcf Delly.vcf\
    -d test/\
    -O png\
    -b 470.sorted.dedupped.bam > samplot_commands.sh
cd ~/bin/samplot/test/data
samplot vcf \
    --vcf test.vcf\
    -d test/\
    -O png\
    -b NA12878_restricted.bam > samplot_commands.sh
jbelyeu commented 4 years ago

My bad, I forgot to mention an additional change. The default behavior is now to autorun instead of printing out the commands. Did it create the images?

You can also specify that you want to run manually by using the --manual_run option. This will write the commands to samplot_vcf_cmds.tmp (and you can change the command file name using the --command_file option.

Again, my apologies

moldach commented 4 years ago

Hi @jbelyeu

So when you say autorun instead of printing out the commands I assume you meant to omit the > samplot_commands.sh part?

samplot vcf \
    --manual_run \
    --vcf Delly.vcf\
    -d test/\
    -O png\
    -b 470.sorted.dedupped.bam

The output directory test/ only contain index.html and when I open this firefox index.html it says

No data available in table. Showing 1-0 of 0 records

jbelyeu commented 4 years ago

Yes, and I'm not sure what the issue is now. Can you cd into the samplot directory and run the tests (bash runtests.sh) and see if PNGs are created in test/func/test_vcf_auto_dir/?

Also check that the version is 1.0.12 (samplot -v)

moldach commented 4 years ago

Hi @jbelyeu okay so bash runtests.sh creates PNGs and peering into the source code I also tried the following:

samplot vcf \
    -d test/ \
    --vcf ~/bin/samplot-master/test/data/test.vcf \
    --sample_ids HG002 \
    -b ~/bin/samplot-master/test/data/HG002_Illumina.bam \
    --manual_run \
    --command_file test_samplot.sh

bash test_samplot.sh
firefox test/index.html

This also works. Now I'll try it on my data:

Pindel

samplot vcf \
    -d YOLO/ \
    --vcf Pindel.vcf \
    --sample_ids maddog \
    -b  470.sorted.dedupped.bam \
    --manual_run \
    --command_file test_samplot.sh

Here I don't see anything in test_samplot.sh or index.html. Luckily I have a number of (non-empty) call sets we can troubleshoot with: CNVnator, Delly, GRIDSS, Lumpy, Manta, MindTheGap, NGSep, Pindel, Tardis.

NGSep

(base) mtg@pop-os:~/MADDOG$ samplot vcf \
    -d YOLO/ \
    --vcf NGSep.vcf \
    --sample_ids maddog  \
   -b  470.sorted.dedupped.bam  \
   --manual_run  \
   --command_file test_samplot.sh

Traceback (most recent call last):
  File "/home/mtg/anaconda3/bin/samplot", line 10, in <module>
    sys.exit(main())
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/__main__.py", line 31, in main
    args.func(parser)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot_vcf.py", line 444, in vcf
    svtype = variant.info.get("SVTYPE", "SV")
  File "pysam/libcbcf.pyx", line 2676, in pysam.libcbcf.VariantRecordInfo.get
ValueError: Invalid header

Lumpy

Trying Lumpy i see the following:

(base) mtg@pop-os:~/MADDOG$ samplot vcf  \
   -d YOLO/  \
   --vcf Lumpy.vcf \
   --sample_ids maddog \
   -b  470.sorted.dedupped.bam \
   --manual_run  \
   --command_file test_samplot.sh

[W::vcf_parse] Contig 'I' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'II' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'III' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'IV' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'V' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'MtDNA' is not defined in the header. (Quick workaround: index the file with tabix.)

Okay let's follow the prompt:

(base) mtg@pop-os:~/MADDOG$ bgzip -c Lumpy.vcf > Lumpy.vcf.gz
(base) mtg@pop-os:~/MADDOG$ tabix -p vcf Lumpy.vcf.gz 
[E::hts_idx_push] Unsorted positions on sequence #1: 670648 followed by 230637
tbx_index_build failed: Lumpy.vcf.gz

All of the other call sets produce either no-output like Pindel or the error like NGSep.

jbelyeu commented 4 years ago

Ok, well, that narrows it down to data-specific problems. Samplot vcf doesn't plot small events by default (for more details run samplot vcf -h and check out the --min_bp argument), but if the variants in your pindel file are larger than 20 bp that shouldn't be an issue.

The other vcfs have different isssues. Lumpy.vcf probably just needs to be sorted before indexing (check this out: https://www.biostars.org/p/299659/). I'm not familiar with ngsep personally, but it appears not to include the SVTYPE field, which samplot vcf requires.

jbelyeu commented 4 years ago

Is sharing a subset of your pindel data an option? If so I'd be happy to try to reproduce the issue myself

moldach commented 4 years ago

Ok, well, that narrows it down to data-specific problems. Samplot vcf doesn't plot small events by default (for more details run samplot vcf -h and check out the --min_bp argument), but if the variants in your pindel file are larger than 20 bp that shouldn't be an issue.

There are SVs larger than 20 bp in these files.

The other vcfs have different isssues. Lumpy.vcf probably just needs to be sorted before indexing (check this out: https://www.biostars.org/p/299659/). I'm not familiar with ngsep personally, but it appears not to include the SVTYPE field, which samplot vcf requires.

Earlier I ran into similar problems for the .gff3 annotations; I got errors about needing to sort and index

samplot plot \
    -n MADDOG \
    -b 470.sorted.dedupped.bam \
    -o chrI_464950_465071_genes.png \
    -c chrI \
    -s 464950 \
    -e 465071 \
    -t DEL \
    -T Caenorhabditis_elegans.WBcel235.99.gff3.gz

Window size is under 1.5x the estimated fragment length and will be resized to 502. Rerun with -w 60 to override
Traceback (most recent call last):
  File "/home/mtg/anaconda3/bin/samplot", line 10, in <module>
    sys.exit(main())
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/__main__.py", line 31, in main
    args.func(parser)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 3456, in plot
    options.xaxis_label_fontsize,
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 3244, in plot_transcript
    transcript_plan = get_transcript_plan(ranges, transcript_file)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 3147, in get_transcript_plan
    itr = get_tabix_iter(r.chrm, r.start, r.end, transcript_file)
  File "/home/mtg/anaconda3/lib/python3.7/site-packages/samplot/samplot.py", line 183, in get_tabix_iter
    tbx = pysam.TabixFile(datafile)
  File "pysam/libctabix.pyx", line 351, in pysam.libctabix.TabixFile.__cinit__
  File "pysam/libctabix.pyx", line 383, in pysam.libctabix.TabixFile._open
OSError: index `Caenorhabditis_elegans.WBcel235.99.gff3.gz.tbi` not found

After sorting and indexing the samplot plot code above works:

bedtools sort -i Caenorhabditis_elegans.WBcel235.99.gff3.gz | bgzip -c > Caenorhabditis_elegans.WBcel235.99.sorted.gff3.gz
samplot plot \
    -n MADDOG \
    -b 470.sorted.dedupped.bam \
    -o chrI_464950_465071_genes.png \
    -c chrI \
    -s 464950 \
    -e 465071 \
    -t DEL \
    -T Caenorhabditis_elegans.WBcel235.99.sorted.gff3.gz

However, following the biostars advice to sort/index the .vcf did not solve the issue:

cat Lumpy.vcf  | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' > Lumpy_sorted.vcf
bgzip -c Lumpy_sorted.vcf > Lumpy_sorted.vcf.gz
tabix -p vcf Lumpy_sorted.vcf.gz 

samplot vcf  \
   -d YOLO/  \
   --vcf Lumpy_sorted.vcf.gz \
   --sample_ids maddog \
   -b  470.sorted.dedupped.bam \
   --manual_run  \
   --command_file test_samplot.sh
jbelyeu commented 4 years ago

What error did that last command generate?

moldach commented 4 years ago

This is a tricky one that passes with no error message

jbelyeu commented 4 years ago

As I understand it, the first issue with the lumpy data was a failure to read the vcf, which threw an error. Looks like the sort/index steps solved that and we're back to the original problem that also occurred with the pindel vcf. That's what I call progress!

Another thing you might check is that there's an exact match between the maddog sample id and the column name in the vcf

moldach commented 4 years ago

As I understand it, the first issue with the lumpy data was a failure to read the vcf, which threw an error. Looks like the sort/index steps solved that and we're back to the original problem that also occurred with the pindel vcf. That's what I call progress!

Science typically progresses funeral by funeral so were on a better track!

Another thing you might check is that there's an exact match between the maddog sample id and the column name in the vcf

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT maddog

Looks like it's the same name in the column as was used in the sample-id.

What's your e-mail? I have permission to share the Pindel.vcf. I can share a dropbox link with you.

jbelyeu commented 4 years ago

jrbelyeu@gmail.com

I'll also need part of the alignment file. You can use samtools to extract a region (samtools view -hb 470.sorted.dedupped.bam {}:{}-{}> partial.bam). I'll just need enough of the bam to fully encompass a variant or two that should be plotted.

jbelyeu commented 4 years ago

Ok, the issue is that in this Pindel vcf, all of the the genotype for this sample, for all variants, is homref. Samplot vcf only plots when it finds a sample recorded as het/homalt. Is this also the case for the Lumpy vcf? You may want to look into https://github.com/hall-lab/svtyper