Closed luobosi closed 1 year ago
In your first issue, the reason could be option --prune-low-cplx
, which you are using with vg surject
but not when generating BAM directly with vg giraffe
.
In the second issue, --extract M:N
is a developer option in vg haplotypes
. It's used for extracting ~10 kbp haplotype fragments in a specific subchain (e.g. --extract 3:4781
). The purpose is determining how well the kmer-based scores correlate with alignment scores if you align the fragments to the assembled sample.
If you want to extract entire paths in FASTA format, you can do it with vg paths
. For example:
vg paths -x graph.gbz -F > output.fa
You can use the path selection options if you are only interested in some paths. Note that the synthetic haplotypes generated by vg haplotypes
get sequence names like recombination#2#chain_6#0
in the FASTA file, and there is currently no way of mapping the chain identifiers to more meaningful contig names.
Also, if you want to use reference-based workflows (such as mapping with vg giraffe
in BAM format) with the sampled graph, you must use option --include-reference
in the vg haplotypes
command that generates the sampled graph. In that case, you should use the actual reference with external tools.
On the other hand, if you want to generate a sample-specific reference, extract it in FASTA format, and use it in other tools, something like this should work:
# Generate a subgraph with the reference and a single consensus haplotype based on kmer counts
vg haplotypes -t 24 -v 1 --include-reference --num-haplotypes 1 -i graph.hapl -k sample.kff -g sample.gbz graph.gbz
# Extract the generated haplotype but not the reference
vg paths -x sample.gbz -F -S recombination > sample.fa
Graph sample.gbz
will then represent the alignment between the actual reference you may have and the synthetic reference you just generated.
@jltsiren Hi, first of all thank you very much for answering my question quickly and it seems to me that it works!
I did get the fasta file I wanted when I used vg paths as you suggested, so I took it downstream for analysis:
I used the sample.fa as the genome, and performed quantitative analysis of ANI and SNP with other 31 genomes that created reference genome maps (in microorganisms, we believe that ANI at 95 and above generally belong to the same species level): According to the results of ANI, sample.fa is in line with our expectations.Then I use the conventional alignment mapper BWA and other software call snp, and use the reference genome 168 of Bacillus subtilis (this genome is not put in .gbz), the closest sample BEST7003 on the ANI results, and sample.fa itself as references. Simple snp statistics are as follows: Seeing the number of SNPs cross-referenced to 168 in sample.fa confuses me, so I simply run a genome-wide collinearity analysis:
You can see the following information:
Based on the above information, my questions are:
Why sample.fa is much larger than a single genome (in the context of microorganisms), can we understand sample.fa as: compared with input fastq data, the closest genome backbone to fastq data in gbz, plus other genomes' fragmented information? The key to this question is: I consider that the kff file we use is specific to the sample fastq, so I think sample.fa is also sample-specific, but as shown in the collinearity results above, why are there so many fragmentation information exists, and there seems to be overlap between big line fragments and 168 genome (I don't know if this is considered as redundancy of genome information, which may be the reason why the sample.fa file is relatively large, with 11Mb)
How to understand the mutation information such as snp of sample.fa, can it be regarded as: on the basis of n fasta genomes that make up the pan-genome map, find the simulated genome or genome path that is closest to the fastq data (because it is not composed of a single real fasta) , and calculate thr SNP differences between path and input fastq data.If it can be understood in this way, how should the size of sample.fa be explained? After all, as a simulated genome, sample.fa is much larger than the conventional genome.
Thank you again for your accurate and effective reply, and look forward to your answer to my new question!
Our haplotype sampling approach has been designed for human pangenome graphs built with Minigraph-Cactus. It assumes that the high-level graph structure is linear. The algorithm finds nodes each haplotype passes through in the same order, and selects a subset where the distance between nodes is ~10 kpb over the shortest path. If the assumption about linearity holds, the distance should also be ~10 kbp over almost any haplotype. The algorithm selects the closest haplotypes independently in each of those ~10 kbp intervals, using kmers that are specific to the graph region, and combines them in an arbitrary way to form synthetic haplotypes.
Because you are using a PGGB graph for another species, the expected linear structure may not exist in the graph. In that case, many things can go wrong. In particular, the sampled sequences may be much longer than the real genome if there are major rearrangements in the graph.
Hi Jouni,
I also use the vg haplotypes to generate the personalized reference genome. It's a powerful tool! But I'm wondering which haplotype is the best haplotype if I sample multiple haplotypes. In my results, the last haplotype always has the highest genotyping concordance with the consensus genotype. Does this mean that the last haplotype has the highest score?
@hahahafeifeifei The first haplotype is supposed to be the best approximation of the consensus sequence. The selection of subsequent haplotypes shifts the focus away from the consensus towards covering kmers that should be included but were not covered by earlier haplotypes.
1. What were you trying to do?
I intend to use pggb to obtain a pan-reference genome graph composed of 31 strains of Bacillus subtilis, and use vg to obtain bam files similar to BWA, and then use traditional mutation calling software such as GATK and freebayes to analyze the variation results.
2. What did you want to happen?
I want to get the sample personalized reference genome in fasta format as suggested in the tutorial, and after using vg to get the corresponding bam file, I can use traditional mutation calling software for analysis.
3. What actually happened?
Here are two issues that confuse me:
Generate bam directly
vg giraffe -t 32 -Z /pggb/test/out/bs31.giraffe.gbz \ -m /pggb/test/out/bs31.min -d /pggb/test/out/bs31.dist \ -f /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_1.fastq \ -f /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_2.fastq \ -p -o BAM >/pggb/judge/50/bs31_really.bam
$ ls -lah 787M Aug 12 00:38 bs31.bam #generate from .gam 2.9G Aug 11 15:38 bs31.gam 865M Aug 12 09:34 bs31_really.bam #Generate bam directly
warning[vg::Watchdog]: Thread 24 has been checked in for 10 seconds processing: SRR11461738.5808189, SRR11461738.5808189 warning[vg::Watchdog]: Thread 0 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5801260, SRR11461738.5801260 warning[vg::Watchdog]: Thread 18 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5790670, SRR11461738.5790670 warning[vg::Watchdog]: Thread 24 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5808189, SRR11461738.5808189 warning[vg::Watchdog]: Thread 10 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5804810, SRR11461738.5804810 warning[vg::Watchdog]: Thread 17 has been checked in for 10 seconds processing: SRR11461738.5809221, SRR11461738.5809221 warning[vg::Watchdog]: Thread 23 has been checked in for 10 seconds processing: SRR11461738.5794261, SRR11461738.5794261 warning[vg::Watchdog]: Thread 26 has been checked in for 10 seconds processing: SRR11461738.5791155, SRR11461738.5791155
cat 536353.out | grep "Watchdog" | wc -l 140822
vg haplotypes -t 24 -v 1 -i /pggb/judge/test.hapl \ -k /judge/testkmc.kff --extract M:N \ /judge/bs31sample.gbz > /pggb/judge/testref.fa
error: could not parse m from argument "M"
vg gbwt -p --num-threads 16 -r /pggb/test/out/bs31.ri \ -Z /pggb/test/out/bs31.giraffe.gbz
vg haplotypes -t 24 -v1 -H /pggb/judge/test.hapl \ -d /pggb/test/out/bs31.dist \ -r /pggb/test/out/bs31.ri /pggb/test/out/bs31.giraffe.gbz
gain .kff
$ cat kmc.txt /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_1.fastq /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_2.fastq
kmc -k29 -m128 -okff -t16 -hp @/pggb/judge/kmc.txt \ /pggb/judge/testkmc /pggb/judge
vg haplotypes -t 24 -v 1 -i /pggb/judge/test.hapl \ -k /pggb/judge/testkmc.kff \ -g /pggb/judge/bs31sample.gbz \ /pggb/test/out/bs31.giraffe.gbz
vg haplotypes -t 24 -v 1 -i /pggb/judge/test.hapl \ -k /pggb/judge/testkmc.kff --extract M:N \ /pggb/judge/bs31sample.gbz > /pggb/judge/testref.fa