jonassibbesen / rpvg

Method for inferring path posterior probabilities and abundances from pangenome graph read alignments
MIT License
47 stars 6 forks source link

Assertion `hasNodeId(node_id)' failed #43

Open wtulle opened 1 year ago

wtulle commented 1 year ago

Hi, I need help with an error running rpvg command. I'm getting "paths_index.cpp:73: uint32_t PathsIndex::nodeLength(uint32_t) const: Assertion `hasNodeId(node_id)' failed" error, but I have no idea what that means. The full command is: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -a mpmap.gamp -o rpvg --inference-model haplotypes-transcripts

Any idea of what the problem could come from, the .xg, .gbwt or .gamp parameter? Thanks!

jonassibbesen commented 1 year ago

Hi, from the error it looks like the pantranscriptome does not match the graph that was given as input (ref.xg). Could you share the command lines you used when running vg rna and vg mpmap? Thanks!

wtulle commented 1 year ago

Hi, here are the full commands I used:

bgzip -c contigs.vcf > contigs.vcf.gz && tabix contigs.vcf.gz vg construct -r ref.fa -v contigs.vcf.gz -a > ref.vg vg index -x ref.xg -T ref.vg vg autoindex --workflow mpmap -t 2 --prefix vg_rna --ref-fasta ref.fa --vcf contigs.vcf.gz --tx-gff kortha.gtf

vg mpmap -t 2 -x vg_rna.spliced.xg -g vg_rna.spliced.gcsa -d vg_rna.spliced.dist -f r1.fq.gz -f r2.fq.gz > mpmap.gamp vg rna -n kortha.gtf ref.vg -i pantranscriptome.txt -b pantranscriptome.gbwt > pantranscriptome.txt.gz rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts

Thanks!

jeizenga commented 1 year ago

To co-generate the pantranscriptome and the spliced graph/indexes, you should use vg autoindex --workflow mpmap --workflow rpvg. That should guarantee that all of the indexes are consistent with each other. In general, I would recommend against mixing indexes from the automatic vg autoindex pipeline and manual pipelines, which could use different construction parameters.

wtulle commented 1 year ago

Sorry to bother you again, I'm trying your answer but now I'm getting another error I can't resolve: vg autoindex --workflow mpmap --workflow rpvg --prefix vg_rna --ref-fasta ref.fa --vcf contigs.vcf.gz --tx-gff kortha.gff

The output: [IndexRegistry]: Checking for phasing in VCF(s). error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA VCF are insufficient to create target index Haplotype-Transcript GBWT

I've tried adding the parameter --gfa ref.gfa with similar result:

[IndexRegistry]: Checking for phasing in VCF(s). [IndexRegistry]: Provided: VCF [IndexRegistry]: Checking for haplotype lines in GFA. [IndexRegistry]: Provided: Reference GFA error:[vg autoindex] Input is not sufficient to create indexes

jeizenga commented 1 year ago

The issue is that vg autoindex isn't finding phased variants in the VCF, which are required to form the haplotype-specific transcripts in the pantranscriptome. Is your VCF un-phased?

ld9866 commented 1 year ago

The issue is that vg autoindex isn't finding phased variants in the VCF, which are required to form the haplotype-specific transcripts in the pantranscriptome. Is your VCF un-phased? Hello! Have you solved the problem? I had the same problem

jeizenga commented 1 year ago

Have you already checked for phasing in your VCF?

ld9866 commented 1 year ago

Hello! I would like to ask how to "phasing in your VCF", do you mean that this vcf needs to use bcftools for quality control?

jeizenga commented 1 year ago

VCFs can express either phased or unphased genotypes. Phased genotypes link together the alleles at multiple loci as occurring on the same haplotype. Unphased genotypes simply assert the alleles at each locus without specifying what combination of alleles co-occur on each haplotype. The pantranscriptome is built from haplotype-specific transcripts, so you need phased genotypes in order to specify the haplotype sequences. There's more detail about phasing in the VCF format in section 1.4.2 of the file specification.

ld9866 commented 1 year ago

example.vcf.zip I'm sorry to bother you again! I have read the recommended document according to what you said, but I am not very clear about how to carry out the follow-up operation of my vcf. Could you please help me ? Best wishes

jeizenga commented 1 year ago

The most common way you'd see it expressed is if the genotypes were separated with a bar (e.g. 0|1) rather than a slash (e.g. 0/1), but it looks like your genotypes all have ploidy of 1. Does that make sense for your organism? It could be that this is an unhandled edge case in vg autoindex.

ld9866 commented 1 year ago

Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that

[IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT

This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,

jeizenga commented 1 year ago

I think the issue is further upstream in this pipeline: you're not getting a phased VCF from your Minigraph-Cactus workflow. I would think that Minigraph-Cactus could (in theory) produce phased VCFs, so I'm not sure why you're not getting one. I would recommend making a support request/issue to the Minigraph-Cactus developers to see if it's possible to get a phased VCF as output. It's also a bit puzzling to me that you're getting haploid genotypes on chromosome 1 in the example you gave me, since that's a diploid chromosome in primates.

ld9866 commented 1 year ago

Ok, thank you for your reply. Wish you all the best! We will ask the developer about this, thanks.

wtulle commented 1 year ago

Hello, sorry for the delay. I phased my vcf using vcf_phase.py (https://ppp.readthedocs.io/en/latest/PPP_pages/Functions/vcf_phase.html). With this phased vcf the autoindex command is working now.

wtulle commented 1 year ago

Now, I'm getting an error running rpvg: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts

Error: Running rpvg (commit: d0478d077ca882539e653521f0137625ff0da2d4) Random number generator seed: 1678281590 Fragment length distribution parameters found in alignment (mean: 774.566, standard deviation: 19.276) Loaded graph and GBWT (9.37961 seconds, 6.22832 GB) [E::bgzf_read_block] Failed to read BGZF block data at offset 110622063 expected 6159 bytes; hread returned 2687 terminate called after throwing an instance of 'std::runtime_error' what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 7248584834528 group 7242883472583

jonassibbesen commented 1 year ago

Now, I'm getting an error running rpvg: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts

Error: Running rpvg (commit: d0478d0) Random number generator seed: 1678281590 Fragment length distribution parameters found in alignment (mean: 774.566, standard deviation: 19.276) Loaded graph and GBWT (9.37961 seconds, 6.22832 GB) [E::bgzf_read_block] Failed to read BGZF block data at offset 110622063 expected 6159 bytes; hread returned 2687 terminate called after throwing an instance of 'std::runtime_error' what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 7248584834528 group 7242883472583

Hi, I am not sure why this is happening. Would you be able to share the data? You can send it to jonas.sibbesen@sund.ku.dk Thanks!

jonassibbesen commented 1 year ago

Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that

[IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT

This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,

Hi, see related issue: https://github.com/jonassibbesen/rpvg/issues/46#issuecomment-1460024915

ld9866 commented 1 year ago

Hello, sorry for the delay. I phased my vcf using vcf_phase.py (https://ppp.readthedocs.io/en/latest/PPP_pages/Functions/vcf_phase.html). With this phased vcf the autoindex command is working now.

Thank you for your help!

ld9866 commented 1 year ago

Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that [IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,

Hi, see answer here in a related issue: #46 (comment)

Thank you! l am trying it for your suggestion.