pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 39 forks source link

Sorting graph with `-Y` option triggers "Assertion `idx < this->size()' failed" error #548

Open sivico26 opened 9 months ago

sivico26 commented 9 months ago

Dear odgi team,

Thanks for developing odgi. I am working on a huge graph, so each processing step takes a long time. I was pruning some empty nodes from my graph to later explore it with some of my tools. Anyway, when I was optimizing the node space after the pruning, I met the following error:

odgi: /opt/conda/conda-bld/odgi_1687621144080/work/build/sdsl-lite-prefix/src/sdsl-lite-build/include/sdsl/int_vector.hpp:1360: sdsl::int_vector<<anonymous> >::reference sdsl::int_vector<<anonymous> >::operator[](const size_type&) [with unsigned char t_width = 1; reference = sdsl::int_vector_reference<sdsl::int_vector<1> >; size_type = long unsigned int]: Assertion `idx < this->size()' failed.
/var/spool/pbs/mom_priv/jobs/19616399.meta-pbs.metacentrum.cz.SC: line 56: 131458 Aborted                 odgi sort -t $threads -i odgi_pruned.og -p Ygs -O -o og_opt_transfer.og

Looking at previous issues, I found that #430 was a lengthy, relevant discussion. In the end, I adjusted my command and removed the -Y from odgi sort (everything else equal), and it worked. So I can continue with my analyses.

However, this is somewhat unsatisfactory since I cannot do the PG-SGD sort with my graph. If I understood the discussion, the possible reasons listed do not apply to this case since I pruned the graph without trouble. To be precise, this is the command I used:

odgi prune -t $threads -TEc 1 -i $graph -o odgi_pruned.og ## $graph is in .gfa format

Correct me if I am wrong, but this indicates that odgi build does not have any trouble with my graph, which should discard many of the possible problems (e.g. W lines). Furthermore, my input graph for odgi sort was written by odgi prune. Thus, I wonder what could be causing the assertion error.

My graphs are big (before pruning .gfa ~118 Gb, and .og ~245 Gb; after pruning .gfa ~112 Gb and .og ~ 179 Gb), so not so easily shareable, but maybe possible if needed. I can help to check or run commands on them if instructed.

We are missing something around this problem. I wanted to report what I found and continue the discussion.

Let me know what you think.

P.S: Another minor issue: why does odgi prune require -E for -c to work? That does not make sense to me. If I remove some nodes, it follows that I want to get rid of the associated edges as well. The current behavior is that if you specify only -c, it somehow thinks that, since no edges are being removed, you can not let the edges without their associated nodes, so it does not prune the nodes that match the criteria (thus the output graph is identical to the input). To me, this is not a sensible behavior. Why is it like that? I am probably missing something.

subwaystation commented 9 months ago

@sivico26 Could you please share both graphs? You can drop a mail to simon.heumos@qbic.uni-tuebingen.de. On first glance, I would try odgi sort -O first, without the PG-SGD step. Then I would do odgi sort -Ygs. Also did you try vg convert to obtain a GFAv1 file compatible with ODGI? Or how did you generate the graph?

sivico26 commented 9 months ago

Hi @subwaystation, thanks for the quick reply.

I am loading the graphs to our filesystem to see if I can send them that way.

The graphs come from using cactus and its progressive algorithm (it is a super-pangenome actually), which generates a .hal, then I used hal2vg and then vg convert to get the first .gfa. I then post-processed that graph with smoothxg and gfaffix. I made the mistake of not turning off the generation of the consensus paths when using smoothxg, so I need to prune those from the graph. I used odgi to remove the paths successfully, but then that left the 0 coverage nodes (that used to be crossed by consensus paths but not by any other paths), and now I am trying to remove those too.

I hope that helps.

sivico26 commented 9 months ago

@subwaystation,

In theory, a link to download the graph should be in your mail. Let me know if it works.

odgi sort -O should work (it already did for me). Since I added -p gs too, the difference maker is Y.

subwaystation commented 8 months ago

I downloaded your graph, need to run your commands next.

subwaystation commented 8 months ago

@sivico26 Using the most recent master of ODGI v0.8.4-2-g1e12685c, I was not even able to complete the odgi build step:

/usr/bin/time --verbose odgi build -g og_opt_transfer.gfa -o og_opt_transfer.og -t 28 -P
[odgi::gfa_to_handle] building nodes: 100.00% @ 1.46e+06 bp/s elapsed: 00:00:10:54 remain: 00:00:00:00
[odgi::gfa_to_handle] building edges: 100.00% @ 1.52e+06 bp/s elapsed: 00:00:14:38 remain: 00:00:00:00
[odgi::gfa_to_handle] building paths: 13.64% @ 3.53e-02 bp/s elapsed: 00:00:05:39 remain: 00:00:35:50
[odgi::gfa_to_handle] id parsing failure for path Hbul.Hbul_1_chr6H attempting to parse node id from ''
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoull
Command terminated by signal 6
        Command being timed: "odgi build -g og_opt_transfer.gfa -o og_opt_transfer.og -t 28 -P"
        User time (seconds): 3582.92
        System time (seconds): 643.94
        Percent of CPU this job got: 178%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 39:31.37
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 364934480
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 127634161
        Voluntary context switches: 152059032
        Involuntary context switches: 2896258
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Not sure if the file is corrupt, or does not fit the GFA specs. Which version of ODGI were you using?

sivico26 commented 8 months ago

Hi @subwaystation,

That's strange. odgi prune uses odgi build under the hood when the input is .gfa, right? If that is the case, it worked for me indirectly. The version I am using is v0.8.3-26-gbc7742ed, installed through conda.

Do you think it is related to #549? This is the same pruned graph I am referring to. It indeed deviates from GFA specs.

subwaystation commented 8 months ago

I was expecting the raw, unpruned graph. But you already send me the pruned one?

sivico26 commented 8 months ago

I realized that in your log odgi build failed while parsing the path Hbul.Hbul_1_chr6H. Following the commands described in #549, I can confirm this is the first path affected by the trailing ,. So it is very likely this is the problem.

In that case, running something like:

sed -E "s|,\t\*|\t\*|" og_opt_transfer.gfa > new_og_opt_transfer.gfa

Should do the trick

sivico26 commented 8 months ago

This is indeed the pruned graph. Sorry if it was not the desired one. I can send you the one before pruning. Should I proceed?

subwaystation commented 8 months ago

Please do so! Thanks :) This should also help @AndreaGuarracino to better understand the odgi prune problem. And we can find out, if odgi prune actually is the guilty one here.

subwaystation commented 8 months ago

Hi @subwaystation,

That's strange. odgi prune uses odgi build under the hood when the input is .gfa, right? If that is the case, it worked for me indirectly. The version I am using is v0.8.3-26-gbc7742ed, installed through conda.

Do you think it is related to #549? This is the same pruned graph I am referring to. It indeed deviates from GFA specs.

While it uses odgi build before pruning, the graph after the pruning step is making the problems it seems.

sivico26 commented 8 months ago

Yes, what is strange is odgi prune (or odgi view) writing problematic P lines after the pruning.