vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Reproducing MHC example #812

Closed tdlong closed 6 years ago

tdlong commented 7 years ago
  1. Difficult to choose msga parameters for large haplotypes
  2. Difficult to get out a standard "vcf" file, summarizing msga graph
  3. How do I name the chromosomes (impacts #2)

Code below

wget http://hgwdev.sdsc.edu/~anovak/hgvm/hgvm.tar.gz
tar xzvf hgvm.tar.gz
cd hgvm/MHC
cat ref.fa G*fa >MHC.fa

vg version
# v1.5.0-404-gf02928f

# https://github.com/vgteam/vg/pull/803
# This worked!!
vg msga -t 32 -f MHC.fa -b ref -w 1024 -B 2 -W 64 -E 3 -J 32 -D >MHC.vg

#  other switches I saw online ...   vg msga  -B 4 -u 16 -W 256 -J 64 >DRB1-3123.vg
#  -J = 32 or 64 shouldn't this be really big eg. 30kb so you can call insertions of transposable elements
# parameter given at https://github.com/vgteam/vg/wiki/Long-read-assemblies-using-vg-msga do not work [-G -S switches do not exist]
# super unclear what the switches are "doing", how should I choose these switch to align haplotypes for a big region.

vg stats -lz MHC.vg
# nodes 957691
# edges 995920
# length    27138087

vg stats -c MHC.vg | wc -l
# 18230

vg paths -L MHC.vg
#GI568335879
#GI568335954
#GI568335976
#GI568335986
#GI568335989
#GI568335992
#GI568335994
#GI568335997
#ref

vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg
vg index -x MHC.norm.xg MHC.norm.vg

# This crashes
vg index -g MHC.norm.gcsa -k 11 MHC.norm.vg
# PathGraphBuilder::write(): Size limit exceeded, construction aborted

# This doesn't seem to generate a vcf like anything we have seen before
# typically the file has several lines of information, including definitions of what will be in the sample columns
# I suppose you could leave these out, but that is dangerous
# then the file lists each chromosome or contig  (these are not even really defined in the vg file ... they should be)
# I think these fields are important if you are doing anything with the vcf downstream using standard tools.
# then genotypes for each "polymorphism".  Note after format, there is a column for each sample in the file
# so there should be as many extra columns "vg paths -L MHC.vg | wc -l" returns
vg deconstruct -p ref MHC.norm.vg
#  /var/spool/sge/compute-4-43/job_scripts/2108675: line 17:  7065 Killed                  vg deconstruct -p ref MHC.norm.vg > MHC.norm.vcf

####  example of a typical vcf file
##fileformat=VCFv4.1
##FILTER=<ID=BadValidation,Description="MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)">
[tdlong@hpc-login-1-3 work]$ head -n 5 Q30-SNPs.vcf
##fileformat=VCFv4.1
##FILTER=<ID=BadValidation,Description="MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)">
##FILTER=<ID=FisherStrand,Description="FS > 60">
##FILTER=<ID=InDel,Description="Overlaps a user-input mask">
##FILTER=<ID=LowQual,Description="Low quality">
... stuff deleted
##contig=<ID=chr2L,length=23513712>
##contig=<ID=chr2R,length=25286936>
##contig=<ID=chr3L,length=28110227>
##contig=<ID=chr3R,length=32079331>
##contig=<ID=chr4,length=1348131>
.... stuff deleted
##reference=file:///share/adl/tdlong/DSPR/DNAseq/ref/dm6.fa
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  A4  A5  A6  A7  B2  B3  B6  B7
chr2L   5039    .   C   T   466.57  BadValidation;LowVQCBD  AC=4;AF=1.00;AN=4;DP=288;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=4;MLEAF=1.00;MQ=9.72;MQ0=271;QD=4.81    GT:AD:DP:GQ:PL  ./. 1/1:54,5:55:9:116,9,0   ./. ./. ./. ./. ./. 1/1:26,12:37:30:375,30,0
....stuff deleted
adamnovak commented 6 years ago

I think this is really a few issues, which should be reported separately.

tdlong commented 6 years ago

Sorry, I was just throwing my code up there.

I was trying to use the version of vg we had to construct a graph, and then output the "polymorphisms" in a format akin to a vcf file. I was trying to push for a reproducible real regions for which events can be called.

On Jul 10, 2017, at 11:16 AM, Adam Novak notifications@github.com wrote:

I think this is really a few issues, which should be reported separately.

[ ]vg msga is hard to use and has no good documentation. We say what the options do in the help text, but that doesn't really explain what combination of options one ought to use and why.

The graph that comes out of vg msga can be too complex to GCSA-index with the default parameters. To some extent we can't really fix this; if the graph is so complex it can't be indexed, it needs to be pruned, but maybe we can make the defaults a bit more aggressive with regard to the size limit, and produce a better message like "prune your graph".

The VCFs produced by vg deconstruct don't have all the header lines they ought to have.

The semantics/purpose of vg deconstruct aren't clear. It seems like @tdlong https://github.com/tdlong thinks that vg deconstruct ought to be able to produce a VCF that meaningfully describes a graph generated by vg msga, with one VCF sample per FASTA included in the MSGA run. In fact, deconstruct is not supposed to output samples representing the paths in the graph, and vg deconstruct isn't really ever going to be very useful on graphs produced by vg msga (because they'll contain lots of structures not meaningfully representable in non-breakend VCF).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/812#issuecomment-314190218, or mute the thread https://github.com/notifications/unsubscribe-auth/ATCNNx8IpkQPXXuHi12PEUThhPYMvtkDks5sMmpqgaJpZM4NuiqA.