pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
193 stars 39 forks source link

bed file #561

Closed ZiliaMR closed 2 weeks ago

ZiliaMR commented 7 months ago

Hello,

I am conducting an analysis with Helicobacter pylori genomes. Initially, I performed the analysis with a small dataset (n=21) and now have some queries regarding result visualization. I aim to zoom into a specific region containing genes of interest, but it seems I need to utilize either the BED or GFF file format for this purpose. My question is, are these files obtained from the annotation process? If so, which one from my dataset (n=21) should I utilize?

Thanks in advance

ekg commented 7 months ago

You can use annotations over any of the genomes you've put into the pangenome graph. These annotations would be derived by any method that could make them, either in silico or based on RNA analyses, or comparative genomics. In odgi, a BED file can be used to collect a subgraph of interest (odgi extract) or to guide different processes (like odgi depth), or to affect visualization (odgi draw). Let me know if this helps explain.

ZiliaMR commented 7 months ago

Hello,

Thank you very much. I believe I have successfully managed it. However, I am now facing a new challenge.

I am attempting to analyze 1012 genomes of Helicobacter pylori on a server using 1 node (128 GB RAM) with the following command.

pggb -i 1012seqs_hp.fasta.gz -o output_1012 -x auto -n 1012 -p 90 -m

It seems that the analysis was not completed. Here is the message that I obtained.

put_1012/1012seqs_hp.fasta.gz.999a088.mappings.wfmash.paf --invert-filtering
991586.78s user 1694.34s system 1996% cpu 49744.84s total 2476824Kb max memory
[seqwish::seqidx] 0.002 indexing sequences
[seqwish::seqidx] 15.853 index built
[seqwish::alignments] 15.853 processing alignments
[seqwish::alignments] 594.701 indexing
[seqwish::alignments] 14440.858 index built
[seqwish::transclosure] 14440.969 computing transitive closures
[seqwish::transclosure] 14441.373 0.00% 0-10000000 overlap_collect
Command terminated by signal 9
seqwish -s 1012seqs_hp.fasta.gz -p pggb_output_1012/1012seqs_hp.fasta.gz.999a088.alignments.wfmash.paf -k 19 -f 0 -g pggb_output_1012/1012seqs_hp.fasta.gz.999a088.417fcdf.seqwish.gfa -B 10000000 -t 20 --temp-dir pggb_output_1012 -P
22362.05s user 9519.70s system 208% cpu 15274.65s total 125926572Kb max memory

What could be the reason for this? Is it feasible to perform this analysis with the resources or numbers of strains that I am using?

I would appreciate any insights or suggestions you have regarding this.

Thank you in advance for your help.

subwaystation commented 7 months ago

My first guess would be that you ran out of RAM. SEQWISH occupied over 125G when it crashed. Put something like --transclose-batch 10000 --resume and you should find out quickly if this was the limiting factor.