Split files so Schematize does not crash

subwaystation commented 4 years ago

@6br reported that Schematize can not handle large file sizes of e.g. 4GB.

One way to circumvent this is to split the whole output of component_segmentation into several files. As scientists want to take a look at a path at a certain position, we will also need a naive index for these files. Therefore, I propose the following:

Each 1000 bins are put into a seperate file. So even with a huge number of paths, we should still be ~@50MB per file. But this may vary greatly, because of the number of links.
File name: graph_name.bin_start.component_start.#components.json.
Index file: For each path, we store an array of files and the nucleotide start position for that specific path in that file. Might look something like:
```
path_name: graph_name.bin_start.component_start.#components.json: path_start_nucleotide_postion
```
Concrete example: path1: [graph1.1.1.56: 1] A binary search on such a list should be fast enough to find the file(s) of choice. Note that the searched data might be at the end of the file, so we would have to load two of them in Schematize.
Edge case: Reverse paths. @ekg Is a path that is reverse in the current JSON schema always fully reverse? So each bin is declarated as reverse? Else, my proposed naming and indexing scheme above breaks apart.

I am open to any suggestions or alternative solutions @josiahseaman @superjox @ChriKub @6br @ekg . This is what I was able to produce on short notice and will have to be replaced by a real index solution later. Though, not in the scope of our current planning.

6br commented 4 years ago

When we want to visualize large file, it fails.

Module build failed: RangeError [ERR_FS_FILE_TOO_LARGE]: File size (17769722828) is greater than possible Buffer: 2147483647 bytes

subwaystation commented 4 years ago

@6br to clarify: Did you see this with component_segmentation or with Schematize?

ekg commented 4 years ago

I guess it'd make sense to build a custom index. But an alternative would be to wrangle XG or similar into shape, and wrap that accordingly.

Note that the python bindings we have for odgi could be replicated on top of XG. I actually think there has been some work to do so.

subwaystation commented 4 years ago

So the workflow would be:

odgi build
odgi sort
odgi index
odgi subgraph
odgi bin
component_segmentation
schematize

As the graph is already linearized for the sorting, it is still accessible for XG the same way as in VG?

Thanks for the tips @ekg. I will explore the possibilities of XG tomorrow.

subwaystation commented 4 years ago

I just realized with the setup above we have the following problem: Binning only a subgraph means each time we do this on a different subgraph, we get a different binning for overlapping regions. So we can specify in odgi subgraph the desired binning size and the subgraph will already stick to the binning schema. Let's say I want position 456 of path1 and I have a binning of 10 in mind. The resulting subgraph should be expanded left and right (on the pangenome positions), so that the subsequent binning will fit withing the bin size.

6br commented 4 years ago

@6br to clarify: Did you see this with component_segmentation or with Schematize?

Actually due to Schematize. component_segmentation seems work well. However, I am wondering why the number of paths decrease after the pipeline.....

josiahseaman commented 4 years ago

component_segmentation is definitely the correct place to split the files. Then we can split on boundaries that are both bin boundaries and component boundaries. There will be no "half objects" or duplicates. So we'll just generate multiple files as an output.

@6br Paths decreased? I would check for empty paths or non-unique path names. In general I'm not a big fan of using odgi to grab a subgraph, since not all paths will be present in the subgraph selected region.

subwaystation commented 4 years ago

@josiahseaman That is also my concern. However, I am wondering if XG could see a sorted ODGI as the matrix we output via ODGI bin. Then we would tell XG a path + position + bin size. We then would jump to that position in the graph. But we will also take a look at the corresponding pangenome position. We can then adjust that position so that the extracted subgraph corresponds to our desired binning given by the bin size. @ekg Do you think that is possible?

Else I have to carefully think about splitting files on the component_segmentation level. Here I would have to create a completely new index from scratch. Which means a lot of testing, etc.

Extracting a subgraph we require out of an ODGI graph would be the most elegent solution I would say. I just don't know, if that is possible.

josiahseaman commented 4 years ago

Assuming we don't run out of memory in Python, I don't see what the problem would be with splitting files at the json output step of component_segmentation. What's the issue? We already have global pangenome coordinates at that point and the linkColumns are already populated with their pangenome coordinate pairs. Those links can span across files because we maintain the pangenome coordinates. So the first entry in the third file could still be pangenome_position=48291012 and you carry on from there. We can render dangling links without knowing where the link comes down, that's already in Schematize. Done this way, all paths will be output in all files. It's not disk efficient, but it's simple to parse.

subwaystation commented 4 years ago

After a fruitful discussion thanks to @ekg we both agreed to the method "splitting at component_segmentation". @josiahseaman thanks for the feedback again. So we should have covered all the edge cases!

So I will implement as suggested in https://github.com/graph-genome/component_segmentation/issues/4#issue-563896159 with the addition, that I will add a range (from median.nucleotide to median.nucleotide) for each file for each path.

subwaystation commented 4 years ago

So overnight I discovered a huge flaw in my proposed indexing schema. Assuming we have 1000 consecutive bins in each file, the following could happen: For a path we have positions 200-500 and positions 4500-6000. So the range would be 200-6000. However, in another file, we could have positions 1000-2000. If I want to jump to position 1500, both files would be possible........

So I would have to store all consecutive ranges of a path of each file....

@josiahseaman @ekg @6br @superjox What do you think here?

subwaystation commented 4 years ago

Another edge case that just came to my mind: A component could consist of more than 1000 bins. Like e.g. 1000s of them. We would have to put that one in one file. I do not see any other possibilities there.

josiahseaman commented 4 years ago

So overnight I discovered a huge flaw in my proposed indexing schema. Assuming we have 1000 consecutive bins in each file, the following could happen: For a path we have positions 200-500 and positions 4500-6000. So the range would be 200-6000. However, in another file, we could have positions 1000-2000. If I want to jump to position 1500, both files would be possible........

I've spent 10 minutes thinking about the problem you posed and trying to understand the difficulty. I think it helps to clearly separate pangenome position from path position in your examples. Finding pangenome position is always trivial: you check the beginning of each file which has first_bin and last_bin in pangenome coordinates. When you find a range that overlaps your target, you load in that full file.

Now finding a path position is pretty tricky, in fact it can't be done perfectly from our JSON since it is a loss of information compared to the sorted .og files. We had verbally talked about simply brute forcing the path position in the browser as an easy to implement hack. But I had no illusions it would be fast or elegant: starting at the row for that path, read through the entire matrix from beginning to end and check the center_nucleotide, then update the jump_bin_id if the center_nucleotide is closer than any previous seen. This means that any GOTO requires a scan of the full file. When we split that up into multiple files, that means loading and scanning all files. Certainly not an elegant solution, but it'll work easily for small files.

This is why as far back as Fukuoka we were planning on depending on an odgi service to look up path -> pangenome position backed by an .og file or database. There simply isn't an elegant way to do this in javascript. It's either a hack, or the real deal as an odgi microservice. For now, I'm happy to split the files so they can be displayed and then look at odgi CLI for coordinates.

subwaystation commented 4 years ago

Thanks @josiahseaman to lay out your detailed thoughts. I agree, that the only reasonable solution would be a service in odgi.

The current plan is to implement a simple version of vg's xg index. We will have a path position index for the bins. The idea is as follows: For each path: Initialize a compressed integer vector with the size of the number of nucleotides of the path to index. We then traverse the bins of that path in the correct order. Each time we visit a bin, we know the nucleotide positions that are present in that bin for that specific path. So in our vector, we go to the nucleotide position and insert the current bin id there. This is done for all nucleotides of a bin.

Assuming, we want to know, in which bin position i is, we just go to the i-th position in the vector where the bin id is stored. Should be fairly simple. Implementing it will be a nice learning curve for me.

josiahseaman commented 4 years ago

In that case, an even shorter path would be to use vg and just build the adapter code to get our two pieces of software to talk. VG and ODGI are both C++ so I don't see that one is any more accessible than the other.

josiahseaman commented 4 years ago

@superjox If you'd like to pick up where I left off here's the status:

Segmentation is checking calculating the number of cells to output to each file. This should result in ~10MB files.
File series are now numbered .schematicXX.json and include first_bin and last_bin for the entire file
I'm told Simon's project will direct us to the correct file to open.
Segmentation is based on the nearest component boundary, so it's not perfectly at 100,000 cells. This is desired behavior.
Currently we are tracking the number of bins, but not how many nucleotides are in each bin

Things to do:

segmentation needs a user parameter or to parse the filename to get the bin_size, or number of nucleotides per bin. Right now these values in the output JSON are a lie.
Schematize (javascript) needs code to automatically switch files based on the file name .schematic01 .schematic02 etc. This will go through the filename in ViewportInputs.js that I just wrote Friday.
Additionally adapters to make sure segmentation is informing Simon's service where the file boundaries are (and what their bin size is)

subwaystation commented 4 years ago

@josiahseaman @superjox It will direct us to the bin identifier to look at. We still need an additional service that will go from bin_id -> file. Then we can read in that file.

josiahseaman commented 4 years ago

You can either output my cut_points array or just peek the beginning of each file to find the first_bin and last_bin.

joehagmann commented 4 years ago

Things to do:

segmentation needs a user parameter or to parse the filename to get the bin_size, or number of nucleotides per bin. Right now these values in the output JSON are a lie.

should be done plus a csv file that maps the file_nr to first_bin_in_first_component_of_file to be read in by Schematize (branch dev_jh_splitFiles)

josiahseaman commented 4 years ago

I think this issue is mostly done on the python side.

josiahseaman commented 4 years ago

It appears bin2file.json has a serious bug that appears in large files. Notice the lack of continuity:

    "files": [
        {
            "file": "chunk000_bin1000.schematic.json",
            "first_bin": 0,
            "last_bin": 0
        },
        {
            "file": "chunk001_bin1000.schematic.json",
            "first_bin": 1,
            "last_bin": 35991
        },
        {
            "file": "chunk005_bin1000.schematic.json",
            "first_bin": 35992,
            "last_bin": 36009
        },
        {
            "file": "chunk006_bin1000.schematic.json",
            "first_bin": 36010,
            "last_bin": 118859
        },
        {
            "file": "chunk018_bin1000.schematic.json",
            "first_bin": 118860,
            "last_bin": 138596
        },

Where's chunk0002-4? Or Chunk007-17?

ekg commented 4 years ago

Can the segmentation algorithm be ported to work directly on XG or ODGI?

On Fri, Feb 28, 2020 at 4:48 PM Josiah Seaman notifications@github.com wrote:

Reopened #4 https://github.com/graph-genome/component_segmentation/issues/4.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/component_segmentation/issues/4?email_source=notifications&email_token=AABDQEK7HAF3GOZZLWIDSRDRFEW6XA5CNFSM4KTWZ7QKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOW65IESQ#event-3082453578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPLE2DYIDLSJSCI7ETRFEW6XANCNFSM4KTWZ7QA .

josiahseaman commented 4 years ago

Offending Code:

        for i, cut in enumerate(cut_points[:-1]):
            end_cut = cut_points[i + 1]
            these_comp = self.components[cut:end_cut]
            if these_comp:  # when we're out of data because the last component is 1 wide
                partitions.append(
                    PangenomeSchematic(JSON_VERSION,
                                       self.bin_width,
                                       these_comp[0].first_bin,
                                       these_comp[-1].last_bin,
                                       these_comp, self.path_names, self.total_nr_files))
                bin2file_mapping.append({"file": self.filename(i),

Given the continuity we see in first_bin, last_bin+1, there must be duplicated in cut_points. This would be caused by many bisect searches all being crammed into the same bins_per_file search inside of find_cut_points_in_file_split(). This is a maximum component size problem #7. So this doesn't cause any loss of data, just big chunk files.

I've added a second step in find_dividers that adds additional ones for too big components. Athaliana without limit: 12 chunks, all even under 310KB. 1393 components cells_per_file // 5 with limit: same as above max_component_size = 5: only 5 chunks, all over 1MB. 1781 components. More content per file.
It seems like splitting up components too finely is throwing off the cells_per_file calculation in output. Finer estimation: max_component_size = 5: 10 chunks original: same 10 chunks This seems to be an issue with poor estimation in lazy_average_occupants() which I have no beefed up to 100 samples

josiahseaman commented 4 years ago

@ekg short answer: No. Longer answer: It's written in Python so you could base new C++ code. Optimistic answer: Maybe SWIG voodoo can transpile this to C?

subwaystation commented 4 years ago

@josiahseaman solved this ;)

graph-genome / component_segmentation

Split files so Schematize does not crash #4