vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

vg index #4104

Open Tonitsk8264 opened 1 year ago

Tonitsk8264 commented 1 year ago

I split vcf, fasta and gtf according to chr, and followed the steps of https://github.com/jonassibbesen/vgrna-project-paper/tree/main/scripts/bash/contruct_pantranscriptome https://github.com/vgteam/vg/wiki/Index-Construction#complex-graph-with-many-paths to build the index step by step, specific command line The parameters are as follows.

CPU=4
OUT_PREFIX=pantranscriptome
# 1.construct_graphs
vg construct -p -t ${CPU} -a -v ${CHR}/${CHR}.vcf.gz -r ${CHR}/${CHR}.fasta
vg convert -p ${CHR}/${CHR}.vg > ${CHR}/${CHR}.pg
vg rna -p -t ${CPU} -e -n ${CHR}/${CHR}.gtf ${CHR}/${CHR}.pg

# 2.join_graphs
vg ids -j $(for i in $(seq 1 9; echo UN); do echo chr$i/chr$i.spliced.pg; done) -m empty_node_mapping.map

# 3.project_transcripts
vg index -p -t ${CPU} -G ${CHR}/haplotypes.gbwt -v ${CHR}/${CHR}.vcf.gz ${CHR}/${CHR}.spliced.pg
vg rna -p -t ${CPU} -o -r -n ${CHR}/${CHR}.gtf -l ${CHR}/haplotypes.gbwt -b ${CHR}/${OUT_PREFIX}.gbwt -f ${CHR}/${OUT_PREFIX}.fa -i ${CHR}/${OUT_PREFIX}.txt ${CHR}/${CHR}.spliced.pg > ${CHR}/${OUT_PREFIX}_tmp.pg && mv ${CHR}/${OUT_PREFIX}_tmp.pg ${CHR}/${OUT_PREFIX}.pg

# 4.generate_xg
vg index -p -t 4 -b ./tmp/ -x ${OUT_PREFIX}.xg $(for i in $(seq 1 9; echo UN); do echo chr${i}/"${OUT_PREFIX}".pg; done)

# 5.merge
vg gbwt -p -m -o ${OUT_PREFIX}.gbwt $(for i in $(seq 1 9; echo UN); do echo chr${i}/${OUT_PREFIX}.gbwt; done) 
cat $(for i in $(seq 1 9; echo UN); do echo chr${i}/${OUT_PREFIX}.txt; done) | grep -v ^Name > ${OUT_PREFIX}.txt
cat $(for i in $(seq 1 9; echo UN); do echo chr${i}/${OUT_PREFIX}.fa; done) > ${OUT_PREFIX}.fa

# 6.generate_gcsa
mv empty_node_mapping.map mapping
for i in $(seq 1 9; echo UN); do vg prune -p -t ${CPU} -u -a -m mapping chr${i}/${OUT_PREFIX}.pg > chr${i}/${OUT_PREFIX}_pruned.vg  ; done
vg index -p -t ${CPU} -b ./tmp/ -g ${OUT_PREFIX}.gcsa -f mapping $(for i in $(seq 1 9; echo UN); do echo chr${i}/${OUT_PREFIX}_pruned.vg; done)

# Generate distance index
vg index -p -t ${CPU} -j ${OUT_PREFIX}.dist ${OUT_PREFIX}.xg

vg mpmap -t 16 -x pantranscriptome.xg -g pantranscriptome.gcsa -d pantranscriptome.dist -f 2L-1/1.R1.fastq.gz -f 2L-1/1.R2.fastq.gz > 2L-1.gamp

rpvg --graph pantranscriptome.xg --paths pantranscriptome.gbwt --alignments 2L-1.gamp --output-prefix rpvg --inference-model haplotype-transcripts --path-info pantranscriptome.txt --threads 4

However, after using the vg mpmap command, an error still occurs when using rpvg

rpvg: /psd/container/micromamba-0.25.1/envs/VG_v1.49.0/share/rpvg/src/main.cpp:317: spp::sparse_hash_map<std::__cxx11::basic_string<char>, PathInfo> parseHaplotypeTranscriptInfo(const string&, bool, bool): Assertion `haplotype_transcript_info_it.first->second.source_ids.emplace(haplotype_id_index_it.first->second).second' failed.

Looking forward to your response.

jeizenga commented 1 year ago

Which of these commands is the one that led to the crash? The warning suggests you may have started the pipeline with one version of VG and then switched at some point. Is that the case?