Detect prohibitively large components and skip (initial) layout/rendering for them

fedarko commented 5 years ago

For a connected component of about 27.5k nodes and about 35k edges, with dot 2.26.0.

This causes collate.py to crash after starting the layout process. ~~Generates a (mostly, as far as I can tell) empty DB file and nothing else -- but it'd be worth messing around with -pg/-px to see what we can get from dot.~~

I'm thinking this might have something to do with just really large files breaking the Subprocess infrastructure, but it could totally also have arisen from the assembly graph file just breaking dot (so if we can get the relevant file via -pg and then crash dot with it, we'll have proof that dot has a hard upper limit here). I guess if that's the case we could compensate for that by either laying out with sfdp/etc or by skipping prohibitively large components, but that's kind of a bummer honestly.

~~Tagging as both types of bug because IDK which this is exactly.~~

Here's the relevant line of code that raises the error I'm getting -- looks like the distro of PyGraphviz in conda might be slightly different from what's on GitHub, so a TODO is to see if getting a newer version installed might fix this (could also try running this on my personal laptop instead of barnacle and seeing if that changes anything).

Once this is fixed, it'd be a good idea to replicate this in a test case.

UPDATE: This problem is demonstrably caused by dot crashing. This implies some sort of upper limit on component size.

Options to compensate for this:

Use sfdp or another layout tool that's more performant than dot to lay out large components
Just skip laying out large components
Only lay out selected regions of large components (similar to how Bandage approaches this)

fedarko commented 5 years ago

Got the relevant file with -pg. Confirmed that it crashes dot 2.26.0 (with a segfault) when dot is given 16gb of memory. Trying with more memory now, but I doubt that'll fix this.

UPDATE: Yeah, tried this with 16gb, 32gb, and 90gb of memory. In each case, dot crashed. This pretty much confirms that we can't find an easy way around this.

fedarko commented 5 years ago

See prior notes in #28. The nslimit/nslimit1 options might be worth checking out here. (If the graph exceeds, say, 10k nodes and 10k edges, then collate could use these options when invoking dot -- we could also let the exact threshold conditions and option values be configurable to the user.)

fedarko commented 5 years ago

Yeah, I tried the first graph mentioned here with nslimit and nslimit1 both set to 1, as well as mclimit set to 0.1 and splines set to false (as suggested here). dot still crashed. this was even given a lot of memory (48GB iirc).

From using dot -v, it looks like dot doesn't even make it through one iteration of the simplex algorithm...

I tried sfdp, and that actually seemed to work here (although I didn't have the triangulation library built, so the layout is probs going to be extra gross). However, at this point I'm pretty sure that trying to render this component will just straight up crash the web interface in most environments, so it is ... probably worth going with option 2 above (just skip displaying it). We should consider letting the user isolate certain regions of the component and lay that out (e.g. if they have some bubbles that are interesting), sort of like how Bandage approaches this -- but for now it's important to have MetagenomeScope not crash the system when it encounters a complex graph, so let's just show the user what we can and then work on improving the UI later.

Also: once we get #35 addressed, not being able to render huge components should be less of an explicit downside (since we can still show some data about them). I think showing parts of the graph is definitely a tractable solution -- we could even compute certain tiny layouts in the browser with e.g. https://gitlab.com/graphviz/graphviz/issues/1275 or https://github.com/mdaines/viz.js/ (similarly to what AGB does), but that's going to definitely require some fancy code to reimplement the python stuff in JS.

fedarko commented 5 years ago

I think that default maxima of 7,999 nodes and 7,999 edges are probably sufficient for most cases for now. (so any components with 8k or more nodes or 8k or more edges will be skipped -- this is configurable.) This means that any component with more than these values of nodes or edges won't be laid out and rendered. We can still show some general component-wide stats, and as mentioned above this should pair well with #35.

fedarko commented 5 years ago

To copy from ba7278c's commit message, the following is the only outstanding thing left for this issue:

Make sure printing works properly re: -maxn/-maxe and small component messages (cases: all components except small ones are skipped; all components are skipped; only some of the small components are skipped; etc.).

However, it'd also be good to add a few test cases to make sure this works. Some I can think of:

ensuring that < 1, non-int, etc values of maxn and maxe throw ValueErrors
ensuring that maxn and maxe work properly (check the resulting .db files to verify)
if there's a lot of time available for this: check that the printing stuff is done properly (catch stdout and compare it with the expected output). That's not quite as important as testing the other stuff, though.

fedarko commented 5 years ago

I've still noticed a few cases (in metagenome coassemblies) where this still fails (either taking multiple hours for a single component, or just straight-up crashing PyGraphviz -- investigating that one now since I'm guessing it's because dot crashed as before). It might be worth lowering the defaults even further (I'd really prefer to err on the side of being conservative here, since users can increase maxn/maxe if needed), or adding in a "maxr" argument where the user can adjust the maximum allowable node-to-edge ratio (where components with N nodes and E > 1.5N edges won't get rendered -- perhaps this is further dependent on if the component has above a certain number of nodes in the first place, e.g. 1000). Both of these solutions would help make MetagenomeScope more easily usable for people with large datasets.

fedarko commented 5 years ago

Something also worth testing: when the entire graph consists only of "too large" components. In this case, the preprocessing script should just give a warning/error to the user without producing any DB file.

edit (nov 12, 2020): tested now, sheesh

marbl / MetagenomeScope

Detect prohibitively large components and skip (initial) layout/rendering for them #137