graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

segment_matrix improvements (#30) #31

Closed lomereiter closed 4 years ago

lomereiter commented 4 years ago

Note: there are a few changes in the output JSON files (only one dataset made it into the commit since that's what I test on). This is because order of arrivals and departures arrays' elements is not fixed. For testing against the original results I run jq '.components[].arrivals |= sort_by(.upstream, .downstream) | '.components[].departures |= sort_by(.upstream, .downstream)' on each chunk.

josiahseaman commented 4 years ago

Is this ready for review? I'd like to not that occupants is just a precompute to make Schematize logic a bit simpler. It could be removed if the same code was changed in Schematize to be smarter (which may also slow the browser). Arrivals and departures, however are necessary I believe. They're derived from the order of links listed in the bin. I don't know that it would be simple to remove that requirement.

If you care about JSON size, I have a branch where I changed all the lists to dictionaries or sets which is a radically smaller file size https://github.com/graph-genome/component_segmentation/tree/experimental_v6_sparse_matrix. I ran into issues with this format particularly on the Schematize side. Now I think it wouldn't be worth it with all the other JSON format changes. Something to keep in mind though. If we can precompute things to make the browser faster that's good. However, if the large file size makes the file load slow in the browser, that's counter-productive. I'll leave it to your good judgement.

josiahseaman commented 4 years ago

Question: since we have data checked into the repo, is it going to generate a diff every time the same command is run?

lomereiter commented 4 years ago

Question: since we have data checked into the repo, is it going to generate a diff every time the same command is run?

No, it won't. The order now corresponds to the traversal of the sorted dataframe, there are no random choices involved.