Closed subwaystation closed 4 years ago
When we want to visualize large file, it fails.
Module build failed: RangeError [ERR_FS_FILE_TOO_LARGE]: File size (17769722828) is greater than possible Buffer: 2147483647 bytes
@6br to clarify: Did you see this with component_segmentation
or with Schematize
?
I guess it'd make sense to build a custom index. But an alternative would be to wrangle XG or similar into shape, and wrap that accordingly.
Note that the python bindings we have for odgi could be replicated on top of XG. I actually think there has been some work to do so.
So the workflow would be:
As the graph is already linearized for the sorting, it is still accessible for XG the same way as in VG?
Thanks for the tips @ekg. I will explore the possibilities of XG tomorrow.
I just realized with the setup above we have the following problem:
Binning only a subgraph means each time we do this on a different subgraph, we get a different binning for overlapping regions.
So we can specify in odgi subgraph
the desired binning size and the subgraph will already stick to the binning schema. Let's say I want position 456 of path1 and I have a binning of 10 in mind. The resulting subgraph should be expanded left and right (on the pangenome positions), so that the subsequent binning will fit withing the bin size.
@6br to clarify: Did you see this with
component_segmentation
or withSchematize
?
Actually due to Schematize. component_segmentation seems work well. However, I am wondering why the number of paths decrease after the pipeline.....
component_segmentation
is definitely the correct place to split the files. Then we can split on boundaries that are both bin boundaries and component boundaries. There will be no "half objects" or duplicates. So we'll just generate multiple files as an output.
@6br Paths decreased? I would check for empty paths or non-unique path names. In general I'm not a big fan of using odgi to grab a subgraph, since not all paths will be present in the subgraph selected region.
@josiahseaman That is also my concern. However, I am wondering if XG
could see a sorted ODGI
as the matrix we output via ODGI bin
. Then we would tell XG
a path + position + bin size. We then would jump to that position in the graph. But we will also take a look at the corresponding pangenome position. We can then adjust that position so that the extracted subgraph corresponds to our desired binning given by the bin size. @ekg Do you think that is possible?
Else I have to carefully think about splitting files on the component_segmentation
level. Here I would have to create a completely new index from scratch. Which means a lot of testing, etc.
Extracting a subgraph we require out of an ODGI
graph would be the most elegent solution I would say. I just don't know, if that is possible.
Assuming we don't run out of memory in Python, I don't see what the problem would be with splitting files at the json output step of component_segmentation. What's the issue? We already have global pangenome coordinates at that point and the linkColumns are already populated with their pangenome coordinate pairs. Those links can span across files because we maintain the pangenome coordinates. So the first entry in the third file could still be pangenome_position=48291012
and you carry on from there. We can render dangling links without knowing where the link comes down, that's already in Schematize. Done this way, all paths will be output in all files. It's not disk efficient, but it's simple to parse.
After a fruitful discussion thanks to @ekg we both agreed to the method "splitting at component_segmentation
".
@josiahseaman thanks for the feedback again. So we should have covered all the edge cases!
So I will implement as suggested in https://github.com/graph-genome/component_segmentation/issues/4#issue-563896159 with the addition, that I will add a range (from median.nucleotide to median.nucleotide) for each file for each path.
So overnight I discovered a huge flaw in my proposed indexing schema. Assuming we have 1000 consecutive bins in each file, the following could happen: For a path we have positions 200-500 and positions 4500-6000. So the range would be 200-6000. However, in another file, we could have positions 1000-2000. If I want to jump to position 1500, both files would be possible........
So I would have to store all consecutive ranges of a path of each file....
@josiahseaman @ekg @6br @superjox What do you think here?
Another edge case that just came to my mind: A component could consist of more than 1000 bins. Like e.g. 1000s of them. We would have to put that one in one file. I do not see any other possibilities there.
So overnight I discovered a huge flaw in my proposed indexing schema. Assuming we have 1000 consecutive bins in each file, the following could happen: For a path we have positions 200-500 and positions 4500-6000. So the range would be 200-6000. However, in another file, we could have positions 1000-2000. If I want to jump to position 1500, both files would be possible........
I've spent 10 minutes thinking about the problem you posed and trying to understand the difficulty. I think it helps to clearly separate pangenome position from path position in your examples. Finding pangenome position is always trivial: you check the beginning of each file which has first_bin and last_bin in pangenome coordinates. When you find a range that overlaps your target, you load in that full file.
Now finding a path position is pretty tricky, in fact it can't be done perfectly from our JSON since it is a loss of information compared to the sorted .og
files. We had verbally talked about simply brute forcing the path position in the browser as an easy to implement hack. But I had no illusions it would be fast or elegant: starting at the row for that path, read through the entire matrix from beginning to end and check the center_nucleotide
, then update the jump_bin_id if the center_nucleotide
is closer than any previous seen. This means that any GOTO requires a scan of the full file. When we split that up into multiple files, that means loading and scanning all files. Certainly not an elegant solution, but it'll work easily for small files.
This is why as far back as Fukuoka we were planning on depending on an odgi service to look up path -> pangenome position backed by an .og
file or database. There simply isn't an elegant way to do this in javascript. It's either a hack, or the real deal as an odgi microservice. For now, I'm happy to split the files so they can be displayed and then look at odgi CLI for coordinates.
Thanks @josiahseaman to lay out your detailed thoughts. I agree, that the only reasonable solution would be a service in odgi
.
The current plan is to implement a simple version of vg's xg index. We will have a path position index for the bins. The idea is as follows: For each path: Initialize a compressed integer vector with the size of the number of nucleotides of the path to index. We then traverse the bins of that path in the correct order. Each time we visit a bin, we know the nucleotide positions that are present in that bin for that specific path. So in our vector, we go to the nucleotide position and insert the current bin id there. This is done for all nucleotides of a bin.
Assuming, we want to know, in which bin position i is, we just go to the i-th position in the vector where the bin id is stored. Should be fairly simple. Implementing it will be a nice learning curve for me.
In that case, an even shorter path would be to use vg and just build the adapter code to get our two pieces of software to talk. VG and ODGI are both C++ so I don't see that one is any more accessible than the other.
@superjox If you'd like to pick up where I left off here's the status:
Things to do:
.schematic01
.schematic02
etc. This will go through the filename in ViewportInputs.js that I just wrote Friday.@josiahseaman @superjox It will direct us to the bin identifier to look at. We still need an additional service that will go from bin_id -> file. Then we can read in that file.
You can either output my cut_points array or just peek the beginning of each file to find the first_bin and last_bin.
Things to do:
- segmentation needs a user parameter or to parse the filename to get the bin_size, or number of nucleotides per bin. Right now these values in the output JSON are a lie.
should be done plus a csv file that maps the file_nr to first_bin_in_first_component_of_file to be read in by Schematize (branch dev_jh_splitFiles)
I think this issue is mostly done on the python side.
It appears bin2file.json has a serious bug that appears in large files. Notice the lack of continuity:
"files": [
{
"file": "chunk000_bin1000.schematic.json",
"first_bin": 0,
"last_bin": 0
},
{
"file": "chunk001_bin1000.schematic.json",
"first_bin": 1,
"last_bin": 35991
},
{
"file": "chunk005_bin1000.schematic.json",
"first_bin": 35992,
"last_bin": 36009
},
{
"file": "chunk006_bin1000.schematic.json",
"first_bin": 36010,
"last_bin": 118859
},
{
"file": "chunk018_bin1000.schematic.json",
"first_bin": 118860,
"last_bin": 138596
},
Where's chunk0002-4? Or Chunk007-17?
Can the segmentation algorithm be ported to work directly on XG or ODGI?
On Fri, Feb 28, 2020 at 4:48 PM Josiah Seaman notifications@github.com wrote:
Reopened #4 https://github.com/graph-genome/component_segmentation/issues/4.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/component_segmentation/issues/4?email_source=notifications&email_token=AABDQEK7HAF3GOZZLWIDSRDRFEW6XA5CNFSM4KTWZ7QKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOW65IESQ#event-3082453578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPLE2DYIDLSJSCI7ETRFEW6XANCNFSM4KTWZ7QA .
Offending Code:
for i, cut in enumerate(cut_points[:-1]):
end_cut = cut_points[i + 1]
these_comp = self.components[cut:end_cut]
if these_comp: # when we're out of data because the last component is 1 wide
partitions.append(
PangenomeSchematic(JSON_VERSION,
self.bin_width,
these_comp[0].first_bin,
these_comp[-1].last_bin,
these_comp, self.path_names, self.total_nr_files))
bin2file_mapping.append({"file": self.filename(i),
Given the continuity we see in first_bin, last_bin+1, there must be duplicated in cut_points
. This would be caused by many bisect searches all being crammed into the same bins_per_file search inside of find_cut_points_in_file_split(). This is a maximum component size problem #7. So this doesn't cause any loss of data, just big chunk files.
I've added a second step in find_dividers that adds additional ones for too big components.
Athaliana without limit: 12 chunks, all even under 310KB. 1393 components
cells_per_file // 5 with limit: same as above
max_component_size = 5: only 5 chunks, all over 1MB. 1781 components. More content per file.
It seems like splitting up components too finely is throwing off the cells_per_file calculation in output.
Finer estimation:
max_component_size = 5: 10 chunks
original: same 10 chunks
This seems to be an issue with poor estimation in lazy_average_occupants() which I have no beefed up to 100 samples
@ekg short answer: No. Longer answer: It's written in Python so you could base new C++ code. Optimistic answer: Maybe SWIG voodoo can transpile this to C?
@josiahseaman solved this ;)
@6br reported that Schematize can not handle large file sizes of e.g. 4GB.
One way to circumvent this is to split the whole output of
component_segmentation
into several files. As scientists want to take a look at a path at a certain position, we will also need a naive index for these files. Therefore, I propose the following:graph_name.bin_start.component_start.#components.json
.Concrete example:
path1: [graph1.1.1.56: 1]
A binary search on such a list should be fast enough to find the file(s) of choice. Note that the searched data might be at the end of the file, so we would have to load two of them inSchematize
.I am open to any suggestions or alternative solutions @josiahseaman @superjox @ChriKub @6br @ekg . This is what I was able to produce on short notice and will have to be replaced by a real index solution later. Though, not in the scope of our current planning.