Closed tobiasmarschall closed 5 months ago
Exactly, I thought to output the S
lines ordered by their BO
(primary sort key) and NO
(secondary sort key). In my view, the L
lines should be interleaved, i.e. after a S
line (or multiple one) corresponding to a given BO
tag, you can list of a block of L
lines corresponding to all edges within the corresponding biconnected component. In this way, the information stays together. So if write_graph
would support this (for graphs that do have BO
and NO
tags), that'd be a nice feature.
Yeah, I don't think it would be too hard to do. Famous last words!
I just tried to order based on the BO tags, which should give the desired effect already, I looked at the small graph in Bandage, and the GFA output follows the graph from one end to the other. The only caveat here is that the nodes inside the bubble won't be ordered topologically, but all the nodes in the bubble will come after each other in the GFA output. The L lines are intertwined between the S line.
In theory, I can output an index which keeps bubble IDs and offsets in the output GFA file, and you can use that to only load one or more bubbles instead of reading a huge GFA file.
But I'll keep this idea for another day :)
Thanks. Yes, let's keep indexing separate. I think both Samarendra and I have some thoughts on indexing as well and would be good to discuss first.
Could you also add a test case for this?
Doesn't work for me:
$ time gaftools order_gfa --outdir out_d750351 minigraph-extended_all-Oct24.gfa Reading minigraph-extended_all-Oct24.gfa
The graph has:
The graph has 1250314 Nodes and 1802130 Edges
Connected components: 25
Found one connected component per expected chromosome.
Processing chr1
Input component: 90015 nodes
Finding Biconnected Components of the component chr1
It took 4.152787557002739 seconds to find the Biconnected Components
Bubbles: 17942
Scaffold graph: 35887 nodes
Traceback (most recent call last):
File "/home/tobi/miniforge3/envs/gaftools-dev/bin/gaftools", line 8, in <module>
sys.exit(main())
File "/home/tobi/scm/gaftools/gaftools/__main__.py", line 88, in main
module.main(args)
File "/home/tobi/scm/gaftools/gaftools/cli/order_gfa.py", line 215, in main
run_order_gfa(**vars(args))
File "/home/tobi/scm/gaftools/gaftools/cli/order_gfa.py", line 75, in run_order_gfa
graph.write_gfa(set_of_nodes=component_nodes, output_file=f_gfa, append=False, order_bo=True)
File "/home/tobi/scm/gaftools/gaftools/GFA.py", line 390, in write_gfa
set_of_nodes = sorted(list(self.nodes.keys()), key=lambda x: int(self.nodes[x].tags['BO'][1]))
File "/home/tobi/scm/gaftools/gaftools/GFA.py", line 390, in <lambda>
set_of_nodes = sorted(list(self.nodes.keys()), key=lambda x: int(self.nodes[x].tags['BO'][1]))
KeyError: 'BO'
real 0m30,692s
user 0m28,821s
sys 0m1,848s
Maybe we should start using feature branches and only merge once there are tests and they are ok.
Yes, this is what I was doing before, developing in a separate branch. But those edits seemed harmless enough, and it passed the all automated tests, so I pushed it to main. I'll fix this later today and get back to you
I fixed the bug you were getting, at least I hope so. I did test it right now for the minigraph-extended
and it worked (see time
output below). I still need to add proper test cases to the sorting, which i also changed a bit, it should sort better now. Basically sort all the BOs into buckets, and then each bucket of nodes that have the same BO, gets sorted by NO, and I output that order.
I hope it's stable for now, so I don't need to push to master again and can develop things on the side. But wanted to fix this bug to make it usable again
Command being timed: "gaftools order_gfa minigraph-extended_all_no_seq.gfa --outdir out/"
User time (seconds): 122.75
System time (seconds): 2.03
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:04.87
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3640720
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 32
Minor (reclaiming a frame) page faults: 985690
Voluntary context switches: 72
Involuntary context switches: 1222
Swaps: 0
File system inputs: 326360
File system outputs: 491440
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Yes, works for me now. Thanks for the quick fix.
Should be easy to do, a small modification in the
write_graph
internal GFA function. But just to make sure I understood it correctly, you want to output to follow the BO and NO tag that we start assigning, so for example I take the "first" scaffold node, output it, then output all the nodes in the bubble next to it, then the scaffold node after the bubble, and so on...Should I mix the S and L lines or do you want the output to first have all the S lines, and then all the L lines? At the moment, I loop through the nodes I want to output into a file, write the S line of that node, and then the L lines associated with that node. Should I keep it as is?