jonassibbesen / rpvg

Method for inferring path posterior probabilities and abundances from pangenome graph read alignments
MIT License
47 stars 6 forks source link

Unable to run rpvg analyses without error #53

Open JoshLangman opened 1 year ago

JoshLangman commented 1 year ago

I was attempting to perform analyses using rpvg on the spliced graph, pantranscriptome, and alignment produced with vg however shortly after running rpvg it crashed. The minimal files and pipeline used to reproduce this error are given below, though the same error occurs when using the full sized files.

error:

rpvg: /home/rpvg/src/fragment_length_dist.cpp:23: FragmentLengthDist::FragmentLengthDist(double, double, double, uint32_t): Assertion `isValid()' failed.
Aborted (core dumped)

gff file genomic_chr8.gff

##gff-version 3
NC_050103.1 RefSeq  region  1   182411202   .   +   .   ID=NC_050103.1:1..182411202;Dbxref=taxon:4577;Name=8;chromosome=8;cultivar=B73;gbkey=Src;genome=chromosome;mol_type=genomic DNA
NC_050103.1 BestRefSeq  gene    263720  266280  .   +   .   ID=gene-LOC100273199;Dbxref=GeneID:100273199;Name=LOC100273199;description=uncharacterized LOC100273199;gbkey=Gene;gene=LOC100273199;gene_biotype=protein_coding;gene_synonym=cl25415_1a,GRMZM6G040576
NC_050103.1 BestRefSeq  mRNA    263720  266280  .   +   .   ID=rna-NM_001147643.1;Parent=gene-LOC100273199;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;Name=NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq  exon    263720  264201  .   +   .   ID=exon-NM_001147643.1-1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq  exon    264910  265104  .   +   .   ID=exon-NM_001147643.1-2;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq  exon    265204  265439  .   +   .   ID=exon-NM_001147643.1-3;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq  exon    265540  266280  .   +   .   ID=exon-NM_001147643.1-4;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NM_001147643.1;gbkey=mRNA;gene=LOC100273199;product=uncharacterized LOC100273199;transcript_id=NM_001147643.1
NC_050103.1 BestRefSeq  CDS 264022  264201  .   +   0   ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1 BestRefSeq  CDS 264910  265104  .   +   0   ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1
NC_050103.1 BestRefSeq  CDS 265204  265439  .   +   0   ID=cds-NP_001141115.1;Parent=rna-NM_001147643.1;Dbxref=GeneID:100273199,GenBank:NP_001141115.1;Name=NP_001141115.1;gbkey=CDS;gene=LOC100273199;product=uncharacterized protein LOC100273199;protein_id=NP_001141115.1

vcf file chr8_3lines_renamed.vcf:

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##contig=<ID=NC_050103.1>
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Mo17    P39 B97
NC_050103.1 363813  8-17988 C   G   .   PASS    DP=0;AC=2;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0
NC_050103.1 364616  8-18791 C   T   .   PASS    DP=0;AC=0;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0
NC_050103.1 364621  8-18796 C   G   .   PASS    DP=0;AC=4;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0
NC_050103.1 364622  8-18797 T   C   .   PASS    DP=0;AC=4;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0
NC_050103.1 364624  8-18799 A   C   .   PASS    DP=0;AC=4;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0
NC_050103.1 364629  8-18804 C   T   .   PASS    DP=0;AC=0;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0
NC_050103.1 364734  8-18909 C   G   .   PASS    DP=0;AC=0;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0
NC_050103.1 364782  8-18957 C   T   .   PASS    DP=0;AC=2;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0
NC_050103.1 364901  8-19076 G   T   .   PASS    DP=0;AC=2;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  1/1:0,0:0:33:0,0,0
NC_050103.1 364944  8-19119 T   A   .   PASS    DP=0;AC=0;AN=6  GT:AD:DP:GQ:PL  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0  0/0:0,0:0:33:0,0,0

fastq file SRR5911103.4reads.fastq:

@SRR5911103.1 1 length=90
CAAATTTGCATGGNTATCTGTTATTCCTTTTTANGNGTAANGNCTTNNAANANAATGTAATNNNANNNNNAAANNNNANNNNNNNNNNNN
+SRR5911103.1 1 length=90
AAAAAEEEEEEEE#EEEEEEEEEEEEEEEEEEE#E#EEEE#E#EEE##/E#/#AAEEEEEE###E#####6EA####/############
@SRR5911103.2 2 length=90
TCATATCGGTAGGTTGTGGTATTTCATTGCTACAAACATGGGTTATNGTANAATAAGACATGNNANNNNGATACNNNTCNNNNNNNNNNA
+SRR5911103.2 2 length=90
6AAAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE/E#EEA#AEEEEE/EEEE##E####EEEEE###EA##########/
@SRR5911103.3 3 length=90
ATTGATGCTGTGAGATGCATGTGTGTCTTTTGTTTCACGTTGCATTNCATAGGCAAGTCGAGATGNNNNGTTGGNNNTGTNCNNNANNNT
+SRR5911103.3 3 length=90
AAAAAEEAE6EEAEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEE#EAE/EEEEEEEEEEE6EE####EEEEE###EAE#A###/###E
@SRR5911103.4 4 length=90
TTGGGGTTTGGAGTAGTTCTCATATAGATCTTTATATTCAGCCCCTNTGGGAAGATACATTTCACNNNNAAATCNNNTGANTNNNANNNT
+SRR5911103.4 4 length=90
AAAAAEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEEEEEEEAEEEE#EEEEEEEAEEEEEEEEAE####EEAEE###EEE#E###A###/

yaml file environment.yml for the conda environment:

name: pantranscriptome
channels:
  - conda-forge
  - bioconda
dependencies:
  - vg =1.49
  - cmake >=3.10
  - protobuf =3
  - htslib =1.17
  - jansson =2.14
  - llvm-openmp
  - bcftools =1.17
  - sra-tools =3
  - samtools =1.17

command line pipeline:

$ bgzip chr8_3lines_renamed.vcf
$ tabix chr8_3lines_renamed.vcf.gz
$ curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_902167145.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_902167145.1.zip" -H "Accept: application/zip"
$ unzip GCF_902167145.1.zip
$ mv ncbi_dataset/data/GCF_902167145.1/GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna .
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna
$ samtools faidx GCF_902167145.1_Zm-B73-REFERENCE-NAM-5.0_genomic.fna NC_050103.1 > genomic_chr8.fa
$ samtools faidx genomic_chr8.fa
$ vg construct -p -r genomic_chr8.fa -v chr8_3lines_renamed.vcf.gz > 282_rpvg.vg
$ vg convert -p 282_rpvg.vg > 282_rpvg.pg
$ vg rna -p -s Parent -t 12 -n genomic_chr8.gff -b pantranscriptome.gbwt -i pantranscriptome.txt 282_rpvg.pg > 282_rpvg.spliced.pg
$ vg convert --xg-out 282_rpvg.spliced.pg > 282_rpvg.spliced.xg
$ vg index --threads 4 --dist-name 282_rpvg.dist 282_rpvg.spliced.xg
$ vg prune --progress --threads 4 --unfold-paths --gbwt-name pantranscriptome.gbwt  --append-mapping --mapping mapping 282_rpvg.spliced.pg > 282_rpvg.spliced.pruned.vg
$ vg index --progress --threads 4 --gcsa-out 282_rpvg.gcsa --mapping mapping 282_rpvg.spliced.pruned.vg 
$ vg mpmap --nt-type RNA --threads 4 --graph-name 282_rpvg.spliced.xg --gcsa-name 282_rpvg.gcsa --dist-name 282_rpvg.dist --fastq SRR5911103.4reads.fastq > SRR5911103.gamp
$ vg gbwt --num-threads 4 --r-index pantranscriptome.gbwt.ri pantranscriptome.gbwt
$ wget https://github.com/jonassibbesen/rpvg/releases/download/v1.0/rpvg-v1.0_linux.tar.gz
$ tar zxvf rpvg-v1.0_linux.tar.gz
$ rpvg-v1.0_linux/bin/rpvg --threads 8 --single-end --frag-mean 90 --frag-sd 0 --graph 282_rpvg.spliced.xg --paths pantranscriptome.gbwt -f pantranscriptome.txt --alignments Mo17/SRR5911103.gamp -o rpvg_SRR5911103 --inference-model haplotype-transcripts

Your help in identifying the cause of this error would be greatly appreciated!

jonassibbesen commented 1 year ago

Hi, thank you for writing. The standard deviation for the fragment length distribution (--frag-sd) needs to be above 0. Changing this should resolve the error. Please let me know if you run into any other issues.

twrightsman commented 1 year ago

Hi @jonassibbesen, I'm working with @JoshLangman on this; thank you for the quick response.

In this example we are trying to quantify a single-end 3' RNA-seq library that has 90bp reads (no variance in read length). We have no information regarding the true fragment distribution of the library, unfortunately.

What would you recommend in this case? Perhaps the -l option that removes the effective length normalization is the right path, though I'm not sure if that assumes long reads instead of the short reads we have.