Closed mdsumner closed 6 years ago
See this anglr
issue and replicated unjoin
issue for some general thoughts in addition to which comes this one:
Using 5-byte uid
values is fine and good, but if silicate
is one day to be the super efficient beast it is destined to be, then these will likely be better replaced by simple incremental integers, so that adjacent entities can be compactly represented as varint
class objects in delta
form, meaning most of them will be simply stored as 1
, requiring 1 bit, rather than 5 bytes. My current sphier
-type vision is to pre-label vertices, edges, paths, arcs, and objects with respective first letters ("v", "e", ...), but otherwise to use this incremental integer scheme.
Further co-issue for self: Work out where the ordering of the key_col
integers returned from unjoin::unjoin
comes from? These appear to be pretty randomly ordered, and thus ultimately not amenable to varint::delta
representation. Fix?
That ordering comes from factor
and smashing values together as character:
unjoin(mtcars, drat, wt)
as.integer(factor(paste(mtcars$drat, mtcars$wt)))
[1] 21 22 20 8 11 1 12 16 23 24 24 7 5 6 3 4 13 26 31 28 17 2 10 18 9 25 30 19 29 15 14 27
Watch this space https://github.com/tidyverse/dplyr/issues/1792
I'm finding my stance on this has moved somewhat, partly because I just really need this package on CRAN so I can get on with stuff.
Default to structural IDs, basically the silicore model. Anything that needs to maintain relational IDs can add them. It simplifies this package somewhat and removes the need to include subsetting methods.
I still think this is right, I tried to go again on the SC0 model but got stuck - a full example of normalizing vertices and then normalizing segments is here:
https://github.com/hypertidy/contourPolys#other-attempts
In SC0 (silicore) I cheat by building edges from grouped paths in structural indexing, but we really need edges from normalized indexing as in that readme because juggling back and forth is v hard to understand.
Key is de-ordering segments so that they are identifiable, and I use pmin/pmax but surely there's a faster way to do this, and even a vector type that records the edge as given but can provide undirected identity as needed? (is complex a useful model?)
I think I've got s fair insight into how to do this with C++ std::unordered_map
objects, but the polygon nesting / holes thing is maybe not so simple. I'll give it a go in context of contourPolys/issues/4
I've been messing around with removing heavy indexes, and it's only just occurred to me that a simple binary type can have two tables, object and vertex with a list-col for edge:
compact_indexes_properly <- function(x) {
x <- compact_indexes(x)
ind <- split(x$object_link_edge$edge_,
x$object_link_edge$object_)
x$object$edge <- lapply(ind, function(xx) x$edge[xx, ])
x$object_link_edge <- NULL
x$edge <- NULL
x
}
## still using heavy and inefficient SC to demonstrate the difference
library(silicate)
compact_indexes_properly(SC(minimal_mesh))
$object
# A tibble: 2 x 2
a edge
* <int> <list>
1 1 <tibble [12 × 2]>
2 2 <tibble [4 × 2]>
$vertex
# A tibble: 14 x 2
x_ y_
<dbl> <dbl>
1 0 0
2 0 1
3 0.75 1
4 1 0.8
5 0.5 0.7
6 0.8 0.6
7 0.69 0
So, that's the right core format for SC afaics, and with a edge-recyclerizer we could take it a long way. One thing it loses over current SC is that there's no explicit mapping between instances of edges and their unordered counterparts, but I think edge normalization and ordering is something we need to be able to do on the fly anyway.
For all the stuff I did with PATH that relied on UIDs I think that can be done on the fly, especially just done between pairs of tables rather than always in the whole model.
(I think were telling me about this two table model but I wasn't ready to go the whole way).
It's going to take a fair bit to refactor silicate around this, but it also seems like it gives a potential invalidation capability - because the edges are ordered around the paths originally, and a simple index could record their grouping within paths. If you do something really "edge-based", then you simply remove that path-validity. So maybe PATH can just totally die, and become an invalidate-able subclass of SC, and re-cyclerizing can restore those pieces, and ARC is another special type of PATH.
That looks really promising! I've been a bit hammered here, but will defo take it fer a spin by end of next week
So during my inaction I have actually had some insightful thoughts here, also following on from #61. I reckon this only needs an extra argument for SC
, inherited by all functions called therein. Something like this should suffice:
SC <- function (x, id = NULL, ...) {
if (!is.null (id))
uids <- x ["id"]
else {
idcol <- grep ("id", names (x), ignore.case = TRUE)
if (length (idcol) == 1)
uids <- x [idcol]
else {
if (length (idcol) > 1)
message ("Can not unambiguously determine ID column; going to just make some IDs up")
uids <- [... present code via `ids`]
}
}
Whaddayareckon?
This seems fine in general, everything is currently modelled off not having existing names so it's definitely open to improvement.
I guess OSM in sf is a funny case, because the inherent names aren't meaningful in the SF context. and we'd be better going direct to OSM sources as you say in the other issue.
There's still the problem of whether to normalize vertices, I guess OSM is saying "these are unique", and so then we should maintain that, not do our own uniqueifying based on scary floating point comparisons.
One complicating thought against what I just wrote, and in favour of the idea prior to that. It'll likely be necessary to allow ID values to be user-defined because it'll otherwise be impossible to easily write round-trip tests like in geojsonsf
. And that will require user ability to impose IDs at every level, including PATH
construction, bringing us back to where we were. Having said that, I still intend to do osmdata_sc
, because that will be heaps faster than osmdata_sf() %>% SC()
I'm pondering that, I just pushed the purely structural model "BINARY" (with only untested stubs for the worker verbs) - but a plot method for plot(BINARY(x))
with colouring on object level.
Obviously for sf, sp it's fine because they have no internal labels, we can subset the object
table and there's no problem - if we subset the vertex table we have to re-index the object$edge_ - possibly dropping edges, and possibly dropping objects.
If you give it a labelled model, you'd want a more SC-like type - keeping all the vertex labels and path/way labels. So, do we change SC to use this binary form and keep labels if they are present (with an extra subclass?), or default to always having labels (creating them if needed), or just have different functions for label mode vs structure mode?
I've been grappling with this kind of idea from the beginning!
In binary branch I've put updates to SC, critically it doesn't now use PATH but derives from BINARY.
BINARY is two tables, with object$edge_ a nested structural index into vertex
SC is now three tables, object, edge, vertex. I think I'll drop the segment/edge distinction, and always keep edge with the object_ ID. When edges are really unordered we can make that explicit by remapping, nesting the object ID (if needed, and possibly with a direction tag).
So, BINARY is the primary workhorse for unlabelled models, paths are still implicit because of the edge indexes. SC is all edges, as a link between vertex and object - and it's arbitrary so we can subset whereever and joins handle the rest.
I'm still not sure how to go about this for labelled models (like OSM), but maybe a test upfront for "has_labels", that checks if the input model is internally labelled and if it doesn't - BINARY is the starting point. Otherwise we make sure the verbs return IDs is the model has them, so it means we need an "osm-doc" class or something like that and write the sc_verbs against that. If sc_object/sc_vertex and so on have IDs then we avoid using BINARY at all.
If rgl had names explaining all this would be easier
I'm happy with this compromise now, SC0
is dense, compact. SC
is "fully labelled". TRI and PATH and ARC remain fully labelled.
We could improve things with careful ID management using integers, but I'll leave that for future work.
(BINARY is removed).
Everything is now in master branch.
I am terrified of this, but I put an integer-ID system into integers branch. Technically it can be set by global option to use old big IDs, but that will require more care. Do you see any dangers? ( Obviously we might have a table bigger than the limit, but that can probably be handled when it arises and better done by DB anyway)
One thing is keeping IDs that come in, but that's fine for osmdata_sc and I'll use rownames if present for the sc_uid.data.frame method
One another thought here: The ids::random_ids()
function pretends to use a "bytes"
parameter, but actually just returns UTF-8
encoded characters which can optionally be used for hex-level bit control. In most cases, including here, that's neither used nor relevant, and the effect of this is that each block of two (64-bit) encodings actually contains just 16^2 = 256 possibilities, and so the chance of repeating one of currently-implemented bytes = 5
IDs is 16 ^ -10.
Current osmdata_sc
code uses this function:
// Function to generate IDs for the edges in each way
std::string random_id (size_t len) {
auto randchar = []() -> char
{
const char charset[] = \
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
const size_t max_index = (sizeof(charset) - 1);
//return charset [ rand() % max_index ];
size_t i = static_cast <size_t> (floor (Rcpp::runif (1) [0] * max_index));
return charset [i];
};
std::string str (len, 0);
std::generate_n (str.begin(), len, randchar);
return str;
}
(currently residing here.) Chance of generating identical IDs in this case is 62 ^ -10, which is approximately infinitely lower. Preferable? Easy R implementation like this:
rhash <- function(n = 20) {
paste (c ("jst", sample(c (LETTERS, letters, 0:9), n, TRUE)), collapse = "")
}
Cool thanks! I've been realizing that the way I'm doing it is not very sensible or efficient, and I think integers is fine - but the verbs should pick out existing IDs if they are there.
What's the "jst" for?
For vectorizing, I'd generate every value and split/collapse:
nhash <- 10
n <- 20
chars <- c(LETTERS, letters, 0:9)
unlist(lapply(split(sample(chars, nhash * n, replace = TRUE), rep(1:nhash, each = n)), paste, collapse = ""))
But, I'm sure your c++ will burn that to the ground
What do you think of using negative as well as positive integers for the key? Is that crazy?
Oh, the prefix is arbitrary but allows them to be used as file names, coz some OSs don't allow numeric starts. Negative integers would also need a minus at the front, which is also likely not sensible for file names.
Why would these ever be file names? (But good principles at any rate)
Yeah, quite possibly if for example iterating over object table and saving/caching each one separately. Ya never know, would be my motto, so good to presume it might be desired
Some types come in with names of things at every level (OSM) and some things come in with names at no levels (sf), and others with names only at the first object-level (sp, and sf sometimes).
The generic
sc_uid
should handle these by either preserving existing ones and creating new ones as needed.It would be nice to be able to override these and nominate UIDs from a column or direct input.
The question of how to maintain a global pool of ever-unique UIDs remains,