Open wtulle opened 1 year ago
Hi, from the error it looks like the pantranscriptome does not match the graph that was given as input (ref.xg). Could you share the command lines you used when running vg rna and vg mpmap? Thanks!
Hi, here are the full commands I used:
bgzip -c contigs.vcf > contigs.vcf.gz && tabix contigs.vcf.gz vg construct -r ref.fa -v contigs.vcf.gz -a > ref.vg vg index -x ref.xg -T ref.vg vg autoindex --workflow mpmap -t 2 --prefix vg_rna --ref-fasta ref.fa --vcf contigs.vcf.gz --tx-gff kortha.gtf
vg mpmap -t 2 -x vg_rna.spliced.xg -g vg_rna.spliced.gcsa -d vg_rna.spliced.dist -f r1.fq.gz -f r2.fq.gz > mpmap.gamp vg rna -n kortha.gtf ref.vg -i pantranscriptome.txt -b pantranscriptome.gbwt > pantranscriptome.txt.gz rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts
Thanks!
To co-generate the pantranscriptome and the spliced graph/indexes, you should use vg autoindex --workflow mpmap --workflow rpvg
. That should guarantee that all of the indexes are consistent with each other. In general, I would recommend against mixing indexes from the automatic vg autoindex
pipeline and manual pipelines, which could use different construction parameters.
Sorry to bother you again, I'm trying your answer but now I'm getting another error I can't resolve: vg autoindex --workflow mpmap --workflow rpvg --prefix vg_rna --ref-fasta ref.fa --vcf contigs.vcf.gz --tx-gff kortha.gff
The output: [IndexRegistry]: Checking for phasing in VCF(s). error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA VCF are insufficient to create target index Haplotype-Transcript GBWT
I've tried adding the parameter --gfa ref.gfa with similar result:
[IndexRegistry]: Checking for phasing in VCF(s). [IndexRegistry]: Provided: VCF [IndexRegistry]: Checking for haplotype lines in GFA. [IndexRegistry]: Provided: Reference GFA error:[vg autoindex] Input is not sufficient to create indexes
The issue is that vg autoindex
isn't finding phased variants in the VCF, which are required to form the haplotype-specific transcripts in the pantranscriptome. Is your VCF un-phased?
The issue is that
vg autoindex
isn't finding phased variants in the VCF, which are required to form the haplotype-specific transcripts in the pantranscriptome. Is your VCF un-phased? Hello! Have you solved the problem? I had the same problem
Have you already checked for phasing in your VCF?
Hello! I would like to ask how to "phasing in your VCF", do you mean that this vcf needs to use bcftools for quality control?
VCFs can express either phased or unphased genotypes. Phased genotypes link together the alleles at multiple loci as occurring on the same haplotype. Unphased genotypes simply assert the alleles at each locus without specifying what combination of alleles co-occur on each haplotype. The pantranscriptome is built from haplotype-specific transcripts, so you need phased genotypes in order to specify the haplotype sequences. There's more detail about phasing in the VCF format in section 1.4.2 of the file specification.
example.vcf.zip I'm sorry to bother you again! I have read the recommended document according to what you said, but I am not very clear about how to carry out the follow-up operation of my vcf. Could you please help me ? Best wishes
The most common way you'd see it expressed is if the genotypes were separated with a bar (e.g. 0|1
) rather than a slash (e.g. 0/1
), but it looks like your genotypes all have ploidy of 1. Does that make sense for your organism? It could be that this is an unhandled edge case in vg autoindex
.
Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that
[IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT
This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,
I think the issue is further upstream in this pipeline: you're not getting a phased VCF from your Minigraph-Cactus workflow. I would think that Minigraph-Cactus could (in theory) produce phased VCFs, so I'm not sure why you're not getting one. I would recommend making a support request/issue to the Minigraph-Cactus developers to see if it's possible to get a phased VCF as output. It's also a bit puzzling to me that you're getting haploid genotypes on chromosome 1 in the example you gave me, since that's a diploid chromosome in primates.
Ok, thank you for your reply. Wish you all the best! We will ask the developer about this, thanks.
Hello, sorry for the delay. I phased my vcf using vcf_phase.py (https://ppp.readthedocs.io/en/latest/PPP_pages/Functions/vcf_phase.html). With this phased vcf the autoindex command is working now.
Now, I'm getting an error running rpvg: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts
Error: Running rpvg (commit: d0478d077ca882539e653521f0137625ff0da2d4) Random number generator seed: 1678281590 Fragment length distribution parameters found in alignment (mean: 774.566, standard deviation: 19.276) Loaded graph and GBWT (9.37961 seconds, 6.22832 GB) [E::bgzf_read_block] Failed to read BGZF block data at offset 110622063 expected 6159 bytes; hread returned 2687 terminate called after throwing an instance of 'std::runtime_error' what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 7248584834528 group 7242883472583
Now, I'm getting an error running rpvg: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -f pantranscriptome.txt -a mpmap.gamp -o rpvg --inference-model haplotype-transcripts
Error: Running rpvg (commit: d0478d0) Random number generator seed: 1678281590 Fragment length distribution parameters found in alignment (mean: 774.566, standard deviation: 19.276) Loaded graph and GBWT (9.37961 seconds, 6.22832 GB) [E::bgzf_read_block] Failed to read BGZF block data at offset 110622063 expected 6159 bytes; hread returned 2687 terminate called after throwing an instance of 'std::runtime_error' what(): [vg::io::MessageIterator] obsolete, invalid, or corrupt input at message 7248584834528 group 7242883472583
Hi, I am not sure why this is happening. Would you be able to share the data? You can send it to jonas.sibbesen@sund.ku.dk Thanks!
Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that
[IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT
This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,
Hi, see related issue: https://github.com/jonassibbesen/rpvg/issues/46#issuecomment-1460024915
Hello, sorry for the delay. I phased my vcf using vcf_phase.py (https://ppp.readthedocs.io/en/latest/PPP_pages/Functions/vcf_phase.html). With this phased vcf the autoindex command is working now.
Thank you for your help!
Thank for your reply! First, l used the minigraph-cactus to establish a pan-genome which result in the primates-pg.vcf.gz primates-pg.gfa.gz. Secondly, l used the code "vg autoindex --workflow mpmap -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" which works ok and vg mpmap works well. However, l found that l can not have the pantranscriptome.gbwt and pantranscriptome.txt.gz in your example file. l found that you said the "vg autoindex --workflow mpmap --workflow rpvg" and l print the code "vg autoindex --workflow mpmap --workflow rpvg -t 20 --prefix vg_rna_gfa2 --ref-fasta ref.fa --vcf primates-pg.vcf.gz --tx-gff ref.gtf" but it showed that [IndexRegistry]: Checking for haplotype lines in GFA. error:[vg autoindex] Input is not sufficient to create indexes Inputs GTF/GFF Reference FASTA Reference GFA w/ Haplotypes are insufficient to create target index Haplotype-Transcript GBWT This is exactly the crux of my problem, and I find it very difficult to understand, again to bother you. Best yours,
Hi, see answer here in a related issue: #46 (comment)
Thank you! l am trying it for your suggestion.
Hi, I need help with an error running rpvg command. I'm getting "paths_index.cpp:73: uint32_t PathsIndex::nodeLength(uint32_t) const: Assertion `hasNodeId(node_id)' failed" error, but I have no idea what that means. The full command is: rpvg -t 2 -g ref.xg -p pantranscriptome.gbwt -a mpmap.gamp -o rpvg --inference-model haplotypes-transcripts
Any idea of what the problem could come from, the .xg, .gbwt or .gamp parameter? Thanks!