marschall-lab / gaftools

General purpose utility related to GAF files
https://gaftools.readthedocs.io/
MIT License
9 stars 0 forks source link

Let order_gfa output nodes and edges in an order corresponding to BO and NO tags #16

Closed tobiasmarschall closed 5 months ago

fawaz-dabbaghieh commented 10 months ago

Should be easy to do, a small modification in the write_graph internal GFA function. But just to make sure I understood it correctly, you want to output to follow the BO and NO tag that we start assigning, so for example I take the "first" scaffold node, output it, then output all the nodes in the bubble next to it, then the scaffold node after the bubble, and so on...

Should I mix the S and L lines or do you want the output to first have all the S lines, and then all the L lines? At the moment, I loop through the nodes I want to output into a file, write the S line of that node, and then the L lines associated with that node. Should I keep it as is?

tobiasmarschall commented 10 months ago

Exactly, I thought to output the S lines ordered by their BO (primary sort key) and NO (secondary sort key). In my view, the L lines should be interleaved, i.e. after a S line (or multiple one) corresponding to a given BO tag, you can list of a block of L lines corresponding to all edges within the corresponding biconnected component. In this way, the information stays together. So if write_graph would support this (for graphs that do have BO and NO tags), that'd be a nice feature.

fawaz-dabbaghieh commented 10 months ago

Yeah, I don't think it would be too hard to do. Famous last words!

fawaz-dabbaghieh commented 10 months ago

I just tried to order based on the BO tags, which should give the desired effect already, I looked at the small graph in Bandage, and the GFA output follows the graph from one end to the other. The only caveat here is that the nodes inside the bubble won't be ordered topologically, but all the nodes in the bubble will come after each other in the GFA output. The L lines are intertwined between the S line.

In theory, I can output an index which keeps bubble IDs and offsets in the output GFA file, and you can use that to only load one or more bubbles instead of reading a huge GFA file.

But I'll keep this idea for another day :)

tobiasmarschall commented 10 months ago

Thanks. Yes, let's keep indexing separate. I think both Samarendra and I have some thoughts on indexing as well and would be good to discuss first.

tobiasmarschall commented 10 months ago

Could you also add a test case for this?

tobiasmarschall commented 10 months ago

Doesn't work for me:

$ time gaftools order_gfa --outdir out_d750351 minigraph-extended_all-Oct24.gfa Reading minigraph-extended_all-Oct24.gfa
The graph has:
The graph has 1250314 Nodes and 1802130 Edges
Connected components: 25
Found one connected component per expected chromosome.
Processing chr1
 Input component: 90015 nodes
 Finding Biconnected Components of the component chr1
 It took 4.152787557002739 seconds to find the Biconnected Components
  Bubbles: 17942
  Scaffold graph: 35887 nodes
Traceback (most recent call last):
  File "/home/tobi/miniforge3/envs/gaftools-dev/bin/gaftools", line 8, in <module>
    sys.exit(main())
  File "/home/tobi/scm/gaftools/gaftools/__main__.py", line 88, in main
    module.main(args)
  File "/home/tobi/scm/gaftools/gaftools/cli/order_gfa.py", line 215, in main
    run_order_gfa(**vars(args))
  File "/home/tobi/scm/gaftools/gaftools/cli/order_gfa.py", line 75, in run_order_gfa
    graph.write_gfa(set_of_nodes=component_nodes, output_file=f_gfa, append=False, order_bo=True)
  File "/home/tobi/scm/gaftools/gaftools/GFA.py", line 390, in write_gfa
    set_of_nodes = sorted(list(self.nodes.keys()), key=lambda x: int(self.nodes[x].tags['BO'][1]))
  File "/home/tobi/scm/gaftools/gaftools/GFA.py", line 390, in <lambda>
    set_of_nodes = sorted(list(self.nodes.keys()), key=lambda x: int(self.nodes[x].tags['BO'][1]))
KeyError: 'BO'

real    0m30,692s
user    0m28,821s
sys     0m1,848s

Maybe we should start using feature branches and only merge once there are tests and they are ok.

fawaz-dabbaghieh commented 10 months ago

Yes, this is what I was doing before, developing in a separate branch. But those edits seemed harmless enough, and it passed the all automated tests, so I pushed it to main. I'll fix this later today and get back to you

fawaz-dabbaghieh commented 10 months ago

I fixed the bug you were getting, at least I hope so. I did test it right now for the minigraph-extended and it worked (see time output below). I still need to add proper test cases to the sorting, which i also changed a bit, it should sort better now. Basically sort all the BOs into buckets, and then each bucket of nodes that have the same BO, gets sorted by NO, and I output that order.

I hope it's stable for now, so I don't need to push to master again and can develop things on the side. But wanted to fix this bug to make it usable again

    Command being timed: "gaftools order_gfa minigraph-extended_all_no_seq.gfa --outdir out/"
    User time (seconds): 122.75
    System time (seconds): 2.03
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:04.87
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3640720
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 32
    Minor (reclaiming a frame) page faults: 985690
    Voluntary context switches: 72
    Involuntary context switches: 1222
    Swaps: 0
    File system inputs: 326360
    File system outputs: 491440
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
tobiasmarschall commented 10 months ago

Yes, works for me now. Thanks for the quick fix.