holoviz-topics / EarthSim

Tools for working with and visualizing environmental simulations.
https://earthsim.holoviz.org
BSD 3-Clause "New" or "Revised" License
65 stars 21 forks source link

Design of arc, node, and polygon drawing #272

Open philippjfr opened 5 years ago

philippjfr commented 5 years ago

This issue attempts to outline the design for the polygon and polyline drawing/editing tools and annotators. The motivation here is to figure out how they fit into a workflow for specifying geometries which are then used to define watersheds from which a mesh can be computed and how their current limitations could be addressed.

Before beginning we will define some terminology starting at the highest level and working down:

The important thing to note is that each of these geometries needs to afford the ability to be annotated.

In the next section I will outline the current design of the drawing and editing tools and describe the limitations of the current approach.

Current design

Drawing/Editing

In the current design there are two main tools used for drawing and editing polylines and nodes:

Annotating

Once the vertices and arcs are drawn they can be annotated by editing their values in a table. To edit the values of an arc it can be selected in the table and the values can be edited directly. To edit vertex values the arc must be selected, which will display the vertices in the plot and populate the vertex table.

Proposed workflow

  1. Draw a number of arcs using the PolyDraw tool
  2. Annotate each arc and if needed the nodes in each arc.
  3. If needed, split or edit the arcs/nodes using the PolyEdit tool, splitting an arc will clone the nodes and promote them to feature vertices but not link them in any way.
  4. Press a button to join all arcs or just the selected arcs into polygons which can then be annotated.

Limitations

Once an arc has been split the nodes/feature vertices in each part are completely decoupled, e.g. overlapping nodes or feature vertices which are parts of different polylines cannot be dragged at the same time and their values can diverge when editing the vertex tables. This means this approach is only suited towards a linear workflow where all nodes should be annotated before splitting arcs to avoid their locations and annotated values to diverge.

Positives

Alternative proposals

The main issue with the approach as outlined above is that feature vertices that are part of two or more arcs cannot be edited together. It would therefore be desirable to find a solution that ensures that feature arcs are not decoupled and can be edited together. To make this possible there are three possible proposals I could imagine:

Proposal 0 - Accept the current limitations

Not a real proposal simply an acknowledgement of the current limitations.

Proposal 1 - Add unique ids to vertices

Completely redesign how the drawing and editing tools work by adding unique ids to each vertex. When you start drawing a new arc by snapping to an existing vertex the vertex is cloned, the same happens when an existing arc is split. Whenever an edit is made to a node it searches all other arcs to check if the same node occurs in that vertex and then applies the same edit to that node.

Pros/Cons

Proposal 2 - Match on coordinate

Instead of establishing identity using a unique id the same could be achieved by establishing identity by the exact coordinate of a vertex. This would afford all the same capabilities as an index but would also be quite expensive to compute.

Pros/Cons

Proposal 3 - Multi-node graph representation

The issues in proposals 1 and 2 mostly stem from the format used to represent the arcs since they are stored as a lists of coordinates which means that even though arcs might share a node they have to list them separately. @jbednar instead proposes that there could be a new format for drawing paths which basically consists of a graph-like representation, where the nodes are held in one data-structure and the connectivity between nodes is held in another data-structure, which describes which nodes to connect into an arc. In such a scheme there would be no need to try to match the identity of the nodes, which would be much more scalable.

The format would be something like this:

Node data source

index x y weight
0 90 90 0.3
1 85 87 2.4
2 83 81 1.3
...

Arc data source

arcs
[0, 1, 2]
[2, 3, 0]

Pros/Cons

jbednar commented 5 years ago

Would require adding index column to bokeh ColumnDataSource blurring user level with internal implementation details

Hmm; I'm not sure that blurs user-level concerns with implementation, because aren't our users concerned with adding data associated with each node, by index (to ensure each node gets that data only once and unambiguously)? I think the user does care about such an index, at least in many cases.

philippjfr commented 5 years ago

Hmm; I'm not sure that blurs user-level concerns with implementation, because aren't our users concerned with adding data associated with each node, by index (to ensure each node gets that data only once and unambiguously)?

I don't follow, to associate a value with a node they would edit a cell in a table so they would never even see the node id.

jbednar commented 5 years ago

What I mean is the fact that each node has a unique identity is of user-level importance. Sure, the actual implementation as a node id is an implementation detail, but such an implementation is tightly tied to the notion of identity, and perhaps definitional of it, and so I don't think it's inappropriate to expose that to the user.

philippjfr commented 5 years ago

@jbednar I've added a proposal 3 to capture your suggestion of a graph-like representation.

philippjfr commented 5 years ago

While I totally understand your concern about the scalability of proposals 1 and 2 I've just done some testing and what I've found is that the bottleneck is the rendering step, not the search. To put it another way for very large paths or large collection of paths things will slow down for other reasons long before the node matching becomes an issue.

philippjfr commented 5 years ago

Also @jlstevens just pointed out that if we only have to check for matches on the feature vertices (i.e. start and end nodes) the matching won't be expensive at all.

philippjfr commented 5 years ago

My suggestion would be to go with proposal 2 for the time being and if that proves to be problematic either due to scalability or for other reasons tackle proposal 3 as part of the next phase of work.

jbednar commented 5 years ago

But can't there be arbitrarily many items associated with each node, as in millions? Seems like once someone sets up a simulation using such a polygon, there can be some very long time series of data per node, with millions or billions of time steps for some long climate simulation potentially. Ensuring that such data stays consistent between multiple copies of the node just doesn't seem sensible to me. Using a different data structure before and after simulation is of course an option, but seems very ugly. Simply naturally supporting the idea that a node is a unique thing seems so much more straightforward.

philippjfr commented 5 years ago

Seems like once someone sets up a simulation using such a polygon, there can be some very long time series of data per node, with millions or billions of time steps for some long climate simulation potentially.

Why would anyone ever even consider sending this data to the frontend?

Edit: We are only talking about the data structure for sending the data to the frontend for rendering, editing and annotating the nodes, which can be entirely different to the representation in python. That said even in Python I would strongly argue against shoving timeseries into the same datastructure used for storing the nodes, linking the two by id is going to be much more efficient. Therefore the data structure sent to the frontend should never contain timeseries, at most it will contain one scalar values per node per dimension.

jbednar commented 5 years ago

But doesn't such linking need an id for each node? It seems like any proposal should include the full life cycle of this data, including any drawing tool version, a regular plotting version with some (limited number of) hover columns, and how long series of data associated with nodes will be stored. Basically, how will this all work in practice? That's the only thing really worth optimizing for.

philippjfr commented 5 years ago

But doesn't such linking need an id for each node?

That's accurate but that can be assigned at any point during or after the drawing.

It seems like any proposal should include the full life cycle of this data, including any drawing tool version, a regular plotting version with some (limited number of) hover columns

I actually think the graph-like storage format is a good idea in the longer term but I also pretty strongly disagree with this. Trying to make one data-structure work for the full lifecycle of the data also forces a bunch of design decisions at every level. As I already mentioned it forces us to write a new bokeh renderer but beyond that it also introduces a new quite, non-standard data format to be supported in HoloViews forcing us to extend the interfaces, and would then also require an entirely new element and plotting class. That's a lot of custom, very specialized code that deals with a format that isn't in common usage and that only has large benefits in certain cases. Indeed the graph representation is actually most useful when plotting because it avoids the node matching issue, but in terms of memory footprint I expect the gains to be very minimal because very few nodes will ever be shared (only start and end nodes ever are). On the Python end the graph-based format therefore offers little advantage and would have to be converted to other formats such as GEOS (i.e. shapely) geometries for interoperability anyway.

and how long series of data associated with nodes will be stored

This consideration seems to me entirely separate, the storage of additional information like this is always going be in a separate table/data-structure indexed by id, and I suspect will be strongly driven by whatever simulator is used to generate the data. Mixing the very different considerations for dealing with these two types of data seems overly prescriptive and therefore a very bad idea to me.

Overall while I think the graph based representation would be the most optimal at the drawing stage I'm not at all convinced it should be replicated for the whole workflow. Conversion between formats will always be relatively pretty cheap, and is never going to be the bottleneck. So a little bit of glue code which converts the data appropriately is in my opinion much preferable to being overly prescriptive. Finally I'd also like to optimize for a solution we can deliver in the short term, while not necessarily precluding a better longer term solution.

In the short term here are my pros/cons for switching from the current format which would require node matching to the graph based representation (feel free to add more):

jbednar commented 5 years ago

It seems like any proposal should include the full life cycle of this data, including any drawing tool version, a regular plotting version with some (limited number of) hover columns

I actually think the graph-like storage format is a good idea in the longer term but I also pretty strongly disagree with this.

Just to clarify, I am not saying that a single data structure should be used for all stages. I'm saying that all stages should be taken into consideration in the proposals, so that we don't optimize for some local case without anticipating the knock-on effects elsewhere. The simplest assumption is a single data structure to rule them all, but that default position can be supplanted by others when there is good reason.

Also, be careful to separate performance considerations from consistency considerations -- I'm not so worried about the computational cost of synchronizing shared nodes, but about the fact that at the user level, the user can have any number of items associated with nodes, and there needs to be a reasonable way to deal with that without major programming complexity, computational complexity, or memory problems. In certain cases it's fine to duplicate that information, but in the general case it's not, and I want to ensure that the overall solution doesn't require any unrealistic steps like that.

philippjfr commented 5 years ago

In certain cases it's fine to duplicate that information, but in the general case it's not, and I want to ensure that the overall solution doesn't require any unrealistic steps like that.

I still do not quite follow this point, this is only a major problem if two conditions are met a) there's a lot of information associated with each node and b) there are lots of duplicates. The current scheme would only ever associate a few scalar values with each node during the drawing and annotating steps. What happens after that is a totally different issue and any time-series or other data associated with each node should then be stored in a secondary data structure indexed by node ID. Secondly there will only ever be a few duplicates since we are only allowing feature vertices to be shared.

So again I agree with you in principle, and if we ever want a better representation for arbitrary meshes then these considerations become a lot more important but given the actual use case we have this seems like an extremely expensive case of premature optimization to me.

The simplest assumption is a single data structure to rule them all

This is probably what I disagree with you most strongly on, not only do I see this as being horrendously inefficient and just generally a bad idea, but I think it also massively increases complexity because now suddenly this format needs to support everything, scalar values, ragged arrays, efficient storage of timeseries etc and provide convenient ways of querying the data in a format that is useful.

philippjfr commented 5 years ago

What I really want to know at this point is whether you're on board with making the current implementation work using proposal 1 or 2, which would require only the following steps (most of which I've already prototyped):

And revisit the graph-based format proposal if it proves necessary.

jbednar commented 5 years ago

The simplest assumption is a single data structure to rule them all

I'm not saying that's the simplest approach -- it's the simplest starting point. It's just a baseline, not a final solution. There can be many reasons a more complex approach is better overall.

philippjfr commented 5 years ago

In that case I suppose I really don't understand the point you're making. To me it seems like by far the most complex starting point as well, since a single datastructure that handles all this stuff, almost by definition will have the combined complexity of a number of simpler datastructures which each handle their specific task well.

jbednar commented 5 years ago

A single list of nodes with associated arbitrarily many columns of data, and a separate data structure tracking connections between those nodes, is not a complex starting point. We should have an live meeting to discuss, rather than going back and forth by Github issue.

philippjfr commented 5 years ago

A single list of nodes with associated arbitrarily many columns of data, and a separate data structure tracking connections between those nodes, is not a complex starting point

Earlier you were talking about time series as well, hence my confusion. Yes let's have a meeting.

jlstevens commented 5 years ago

For future reference, I believe https://github.com/ioam/holoviews/issues/3258 is relevant to this discussion.