Closed barracuda156 closed 1 week ago
P. S. That also needs a soft dependency on ggdag
, since that pulls in dagitty
again.
Hi @barracuda156 apologies for the late reply here. Making dependencies + install more lightweight is a very valid suggestion. We're in the middle of a dev sprint for an updated version of CausalQueries at the moment and will take this issue up as part of that. Will keep you updated on this thread.
@till-tietz Sounds good, thank you.
@macartan @lilymedina @gerasy1987 --> not sure if this is possible under the license of dagitty but since we only use very limited functionality from the package we could pull in only those functions to make install more user friendly
@till-tietz seems that the dagitty
R package is under GPL-2 license, which, as I understand, means that if we stay open source (which I assume we are), we can use their code.
That said, I believe we are using dagitty::dagitty,
which is the main function in the package, and also directly uses some of the V8
functionality, so I am not sure we can get around requiring dagitty
.
@gerasy1987 fair point. Just writing up a basic fully R based version of the DAG statement parsing that could perhaps do the trick here. Will branch from this issue and push there
@gerasy1987 something like this should basically replicate the functionality of dagitty::dagitty() |> dagitty::edges() |> data.frame()
which we use to construct other object in make_model()
and require in slightly modified form for plotting.
make_dag <- function(model_string) {
# split by ; first to separate discrete parts of the DAG statement
model_string <- strsplit(model_string, ";")[[1]]
lapply(model_string, function(model_substring) {
nodes <- trimws(strsplit(model_substring, "(<->|->|<-)")[[1]])
edges <- trimws(strsplit(model_substring, "(\\w+)")[[1]])[-1]
dag <- vector(mode = "list", length = length(nodes) - 1)
for(i in 1:(length(nodes) - 1)) {
if(edges[i] == "<-") {
dag[[i]] <- data.frame(
v = nodes[i+1],
w = nodes[i],
e = "->"
)
} else {
dag[[i]] <- data.frame(
v = nodes[i],
w = nodes[i+1],
e = edges[i]
)
}
}
return(dag)
}) |>
dplyr::bind_rows() |>
dplyr::arrange(v)
}
@macartan for your visibility
I tried it out on a bunch of statements and seems to work well though see below
some funny stuff if people make mistakes:
Could be more information:
make_dag("a c") [image: image.png] Show Traceback [image: image.png] Rerun with Debug Error in if (edges[i] == "<-") { : argument is of length zero
Odder behavior here as <> is picked up as the edge and no error:
make_dag("a <> b -> c") v w e 1 a <> b c <>
make_dag("a <- a ->") v w e 1 a a ->
duplication not spotted: need to use "distinct()" ?
make_dag("a -> b; a -> b") v w e 1 a b -> 2 a b ->
we have a bunch of functions we use to get to the plotting data frame, including from dagitty; wonder if these can be reduced also:
dagitty_statement <- paste("dag{", statement, "}") |> dagitty::dagitty()
if (!is.null(x_coord) & !is.null(y_coord)) { names(x_coord) <- names(y_coord) <- nodes
dagitty::coordinates(dagitty_statement) <-
list(x = x_coord , y = y_coord) |>
coords2df() |>
coords2list()
}
On Tue, Aug 6, 2024 at 1:49 PM Till Tietz @.***> wrote:
@macartan https://github.com/macartan for your visibility
— Reply to this email directly, view it on GitHub https://github.com/integrated-inferences/CausalQueries/issues/340#issuecomment-2271100917, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBE57LIHIK5EU6QLEZCISLZQCZ6DAVCNFSM6AAAAABJMSZGKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGEYDAOJRG4 . You are receiving this because you were mentioned.Message ID: @.***>
@macartan thanks for testing. Will add some logic to guard against malformed statements. I think we should be able to reduce reliance on dagitty for plotting too
@macartan this helper should deal with a lot of the edge cases arising from malformed statements. In make_dag
we'll split the statement by ';' then apply consolidate_statement
to each sub-statement. We could even just run consolidate_statement
as a first step in make_model
. Will put everything together and push to this issue's branch for review.
consolidate_statement <- function(statement) {
## consolidate edges
statement <- gsub("\\s+", "", statement)
statement <- strsplit(statement, "")[[1]]
i <- 1
st_edge <- c()
while (i <= length(statement)) {
# Check for pattern "<", "-", ">"
if (i <= length(statement) - 2 &&
statement[i] == "<" &&
statement[i + 1] == "-" && statement[i + 2] == ">") {
st_edge <- c(st_edge, "<->")
i <- i + 3 # Skip the next two elements
}
# Check for pattern "<", "-"
else if (i <= length(statement) - 1 &&
statement[i] == "<" && statement[i + 1] == "-") {
st_edge <- c(st_edge, "<-")
i <- i + 2 # Skip the next element
}
# Check for pattern "-", ">"
else if (i <= length(statement) - 1 &&
statement[i] == "-" && statement[i + 1] == ">") {
st_edge <- c(st_edge, "->")
i <- i + 2 # Skip the next element
}
# Otherwise, just append the current element
else {
st_edge <- c(st_edge, statement[i])
i <- i + 1
}
}
# detect edges
is_edge <- st_edge %in% c("->", "<-", "<->")
# check for bare edges (i.e. edges without origin or destination nodes)
# either dangling edge at beginning / end of statement
has_dangling_edge <- any(c(is_edge[1], is_edge[length(is_edge)]))
# or consecutive edges within statement
consecutive_edge <- rle(is_edge)
has_consecutive_edge <-any(consecutive_edge$values & consecutive_edge$lengths >= 2)
if(has_dangling_edge || has_consecutive_edge) {
stop("Statement contains bare edges without a source or destination node or both. Edges should connect nodes.")
}
## consolidate nodes
# check for unsupported characters in varnames
if(any(c("<",">") %in% st_edge[!is_edge])) {
stop("Unsupported characters in varnames. No '<' or '>' in varnames please.")
}
if("-" %in% st_edge[!is_edge]) {
stop("Unsupported characters in varnames. No hyphens '-' in varnames please; try dots?")
}
if("_" %in% st_edge[!is_edge]) {
stop("Unsupported characters in varnames. No underscores '_' in varnames please; try dots?")
}
# counter that increments every time we hit an edge --> each character
# belonging to a node thus has the same number (node_id)
node_id <- cumsum(is_edge)
# split by node_id and paste all characters with same node_id together
nodes <- split(st_edge[!is_edge], node_id[!is_edge]) |>
vapply(paste, collapse = "", "") |>
unname()
return(list(nodes = nodes, edges = st_edge[is_edge]))
}
@macartan @gerasy1987 finally managed to replace ggdag + igraph functionality with some lightweight customer helpers. changes are in the PR. Let me know if you are happy with how plotting output.
@till-tietz Thank you for working on this!
Just for the record, there are some remnants of dagitty
in the code:
Invalid package aliases in Rd file 'CausalQueries-package.Rd':
‘-package’
* checking Rd cross-references ... WARNING
Missing link or links in Rd file 'clean_statement.Rd':
‘dagitty’
Missing link or links in Rd file 'make_dag.Rd':
‘dagitty’
Missing link or links in Rd file 'make_model.Rd':
‘dagitty’
Missing link or links in Rd file 'print.causal_model.Rd':
‘dagitty’
Missing link or links in Rd file 'summary.causal_model.Rd':
‘details’
See section 'Cross-references' in the 'Writing R Extensions' manual.
I am building from 9d6da894a89630130b5ddada936c1cc8b196a538 now.
Hi @barracuda156 ... will check + clean the roxygen snytax and rebuild with devtools to purge these
dagitty
depends on V8, which is a very heavy dependency and also broken on a number of less-mainstream platforms. Is it possible to make it a suggested, but not required dependency?