vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

About the error of vg haplotypes and the bam file size of vg surject #4051

Closed luobosi closed 1 year ago

luobosi commented 1 year ago

1. What were you trying to do?

I intend to use pggb to obtain a pan-reference genome graph composed of 31 strains of Bacillus subtilis, and use vg to obtain bam files similar to BWA, and then use traditional mutation calling software such as GATK and freebayes to analyze the variation results.

2. What did you want to happen?

I want to get the sample personalized reference genome in fasta format as suggested in the tutorial, and after using vg to get the corresponding bam file, I can use traditional mutation calling software for analysis.

3. What actually happened?

Here are two issues that confuse me:

  1. When I use vg's giraffe alignment, I try to get the gam file first, and then use vg surject to convert the bam. Compared with directly comparing and outputting the results in bam format, there is a significant difference in the size of the two files, so I don't know which bam file is more reasonable and can be used for mutation calls?
    
    vg giraffe -t 32 -Z /pggb/test/out/bs31.giraffe.gbz \
    -m /pggb/test/out/bs31.min -d /pggb/test/out/bs31.dist \
    -f /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_1.fastq \
    -f /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_2.fastq \
    -p  >/pggb/judge/50/bs31.gam
    vg surject \
      -A -C 0 --bam-output --max-frag-len 3000 --prune-low-cplx \
      --xg-name /pggb/test/out/bs31.giraffe.gbz \
      -t 32 --sample Bs31 \
      /pggb/judge/50/bs31.gam > /pggb/judge/50/bs31.bam

Generate bam directly

vg giraffe -t 32 -Z /pggb/test/out/bs31.giraffe.gbz \ -m /pggb/test/out/bs31.min -d /pggb/test/out/bs31.dist \ -f /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_1.fastq \ -f /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_2.fastq \ -p -o BAM >/pggb/judge/50/bs31_really.bam

$ ls -lah 787M Aug 12 00:38 bs31.bam #generate from .gam 2.9G Aug 11 15:38 bs31.gam 865M Aug 12 09:34 bs31_really.bam #Generate bam directly

During the process of generating the bam file, there were multiple Watchdog warnings, and no errors aborted the operation.But I think even if there are algorithm differences, there is a problem with such a large difference(**nearly 78Mb**) in bam files under the same data:

warning[vg::Watchdog]: Thread 24 has been checked in for 10 seconds processing: SRR11461738.5808189, SRR11461738.5808189 warning[vg::Watchdog]: Thread 0 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5801260, SRR11461738.5801260 warning[vg::Watchdog]: Thread 18 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5790670, SRR11461738.5790670 warning[vg::Watchdog]: Thread 24 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5808189, SRR11461738.5808189 warning[vg::Watchdog]: Thread 10 finally checked out after 11 seconds and 0 kb memory growth processing: SRR11461738.5804810, SRR11461738.5804810 warning[vg::Watchdog]: Thread 17 has been checked in for 10 seconds processing: SRR11461738.5809221, SRR11461738.5809221 warning[vg::Watchdog]: Thread 23 has been checked in for 10 seconds processing: SRR11461738.5794261, SRR11461738.5794261 warning[vg::Watchdog]: Thread 26 has been checked in for 10 seconds processing: SRR11461738.5791155, SRR11461738.5791155

cat 536353.out | grep "Watchdog" | wc -l 140822

2. My second question is related to **Haplotype Sampling**. I followed the instructions in the [Haplotype Sampling wiki](https://github.com/vgteam/vg/wiki/Haplotype-Sampling) to generate the gbz format file and the required index content (ri, dist, hapl, kff). When I use the parameter **--extractM:N** An error occurred:

vg haplotypes -t 24 -v 1 -i /pggb/judge/test.hapl \ -k /judge/testkmc.kff --extract M:N \ /judge/bs31sample.gbz > /pggb/judge/testref.fa

error: could not parse m from argument "M"

I want to know which link is wrong?As far as I know, although my strain data belongs to the **haploid genome**, tools such as **samtools and gatk** can handle microbial data well in the diploid mode. I want to **obtain sample-specific ref genome fasta** from the pan-reference genome through this step, so as to complement the hard bias problem from the reference genome.
Forgive me for not being able to find the specific source for a while, but I remember that in your article on the human pangenome reference draft, when evaluating mutation calling, you mentioned that even if you use bam or fasta generated by vg, and then use **traditional mutation calling software(such as freebayes) can also achieve better results** than linear reference genome, so I want to know whether you use the bam generated by vg, and then use the linear reference genome for mutation calling or the haplotypes.fna generated here?

Maybe the description of this text is a bit vague, I will provide a screenshot of the hand-drawn concept map of my problem:
![hand_drawn](https://github.com/vgteam/vg/assets/96641342/c013de82-f342-4904-a5d7-ec29f118e731)
I would like to know whether it is possible to use the bam generated by giraffe as input, **use software such as GATK for mutation calling**, and should the ref genome choose the traditional reference genome or the **Haplotype Sampling sequence**?

**4. What data and command can the vg dev team use to make the problem happen?**

problem about haplotypes:

vg gbwt -p --num-threads 16 -r /pggb/test/out/bs31.ri \ -Z /pggb/test/out/bs31.giraffe.gbz

vg haplotypes -t 24 -v1 -H /pggb/judge/test.hapl \ -d /pggb/test/out/bs31.dist \ -r /pggb/test/out/bs31.ri /pggb/test/out/bs31.giraffe.gbz

gain .kff

$ cat kmc.txt /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_1.fastq /pan_genome/snp/168/fastq/SRR11461738/SRR11461738_2.fastq

kmc -k29 -m128 -okff -t16 -hp @/pggb/judge/kmc.txt \ /pggb/judge/testkmc /pggb/judge

vg haplotypes -t 24 -v 1 -i /pggb/judge/test.hapl \ -k /pggb/judge/testkmc.kff \ -g /pggb/judge/bs31sample.gbz \ /pggb/test/out/bs31.giraffe.gbz

vg haplotypes -t 24 -v 1 -i /pggb/judge/test.hapl \ -k /pggb/judge/testkmc.kff --extract M:N \ /pggb/judge/bs31sample.gbz > /pggb/judge/testref.fa


I have uploaded .gfa, .gbz, fastq, dist, ri, etc. files to designated [public repositories](https://github.com/luobosi/vg_variant_data.git) to ensure data to reproduce the issue. Please let me know if any other documents are required.

**5. What does running vg version say?**

vg version v1.50.1 "Monopoli"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@emerald
jltsiren commented 1 year ago

In your first issue, the reason could be option --prune-low-cplx, which you are using with vg surject but not when generating BAM directly with vg giraffe.

In the second issue, --extract M:N is a developer option in vg haplotypes. It's used for extracting ~10 kbp haplotype fragments in a specific subchain (e.g. --extract 3:4781). The purpose is determining how well the kmer-based scores correlate with alignment scores if you align the fragments to the assembled sample.

If you want to extract entire paths in FASTA format, you can do it with vg paths. For example:

vg paths -x graph.gbz -F > output.fa

You can use the path selection options if you are only interested in some paths. Note that the synthetic haplotypes generated by vg haplotypes get sequence names like recombination#2#chain_6#0 in the FASTA file, and there is currently no way of mapping the chain identifiers to more meaningful contig names.

Also, if you want to use reference-based workflows (such as mapping with vg giraffe in BAM format) with the sampled graph, you must use option --include-reference in the vg haplotypes command that generates the sampled graph. In that case, you should use the actual reference with external tools.

On the other hand, if you want to generate a sample-specific reference, extract it in FASTA format, and use it in other tools, something like this should work:

# Generate a subgraph with the reference and a single consensus haplotype based on kmer counts
vg haplotypes -t 24 -v 1 --include-reference --num-haplotypes 1 -i graph.hapl -k sample.kff -g sample.gbz graph.gbz

# Extract the generated haplotype but not the reference
vg paths -x sample.gbz -F -S recombination > sample.fa

Graph sample.gbz will then represent the alignment between the actual reference you may have and the synthetic reference you just generated.

luobosi commented 1 year ago

@jltsiren Hi, first of all thank you very much for answering my question quickly and it seems to me that it works!

I did get the fasta file I wanted when I used vg paths as you suggested, so I took it downstream for analysis:

I used the sample.fa as the genome, and performed quantitative analysis of ANI and SNP with other 31 genomes that created reference genome maps (in microorganisms, we believe that ANI at 95 and above generally belong to the same species level): 31_sample_ani According to the results of ANI, sample.fa is in line with our expectations.Then I use the conventional alignment mapper BWA and other software call snp, and use the reference genome 168 of Bacillus subtilis (this genome is not put in .gbz), the closest sample BEST7003 on the ANI results, and sample.fa itself as references. Simple snp statistics are as follows: all_snp 168_sample_snp Seeing the number of SNPs cross-referenced to 168 in sample.fa confuses me, so I simply run a genome-wide collinearity analysis: collinearity_sample_168

You can see the following information:

  1. The length of sample.fa is much longer than the conventional whole genome length (11 Mb > 4.2 Mb);
  2. When compared with 168, the first line of the sample.fa, the first part and the second part of the second line have a lot of overlap on 168 genome.

Based on the above information, my questions are:

  1. Why sample.fa is much larger than a single genome (in the context of microorganisms), can we understand sample.fa as: compared with input fastq data, the closest genome backbone to fastq data in gbz, plus other genomes' fragmented information? The key to this question is: I consider that the kff file we use is specific to the sample fastq, so I think sample.fa is also sample-specific, but as shown in the collinearity results above, why are there so many fragmentation information exists, and there seems to be overlap between big line fragments and 168 genome (I don't know if this is considered as redundancy of genome information, which may be the reason why the sample.fa file is relatively large, with 11Mb)

  2. How to understand the mutation information such as snp of sample.fa, can it be regarded as: on the basis of n fasta genomes that make up the pan-genome map, find the simulated genome or genome path that is closest to the fastq data (because it is not composed of a single real fasta) , and calculate thr SNP differences between path and input fastq data.If it can be understood in this way, how should the size of sample.fa be explained? After all, as a simulated genome, sample.fa is much larger than the conventional genome.

Thank you again for your accurate and effective reply, and look forward to your answer to my new question!

jltsiren commented 1 year ago

Our haplotype sampling approach has been designed for human pangenome graphs built with Minigraph-Cactus. It assumes that the high-level graph structure is linear. The algorithm finds nodes each haplotype passes through in the same order, and selects a subset where the distance between nodes is ~10 kpb over the shortest path. If the assumption about linearity holds, the distance should also be ~10 kbp over almost any haplotype. The algorithm selects the closest haplotypes independently in each of those ~10 kbp intervals, using kmers that are specific to the graph region, and combines them in an arbitrary way to form synthetic haplotypes.

Because you are using a PGGB graph for another species, the expected linear structure may not exist in the graph. In that case, many things can go wrong. In particular, the sampled sequences may be much longer than the real genome if there are major rearrangements in the graph.

hahahafeifeifei commented 1 year ago

Hi Jouni,

I also use the vg haplotypes to generate the personalized reference genome. It's a powerful tool! But I'm wondering which haplotype is the best haplotype if I sample multiple haplotypes. In my results, the last haplotype always has the highest genotyping concordance with the consensus genotype. Does this mean that the last haplotype has the highest score?

jltsiren commented 1 year ago

@hahahafeifeifei The first haplotype is supposed to be the best approximation of the consensus sequence. The selection of subsequent haplotypes shifts the focus away from the consensus towards covering kmers that should be included but were not covered by earlier haplotypes.