Apache Arrow? - Githubissues

kbvernon commented 1 year ago

Hi, didn't want to barge in on #12 , so thought I would raise this as a separate issue. Have you considered using an Arrow table to hold your graph object rather than a base R data.frame? In addition to the better memory management, there are other advantages to going the Arrow route, like storing metadata. If this is something you think is worthwhile, I would be happy to help implement at least the R side of it - afraid the C++ side of things is a little bit beyond me at the moment.

I'm coming at this from ecology, where the conversion from a grid to a graph can be prohibitively costly in terms of memory.

Also, a somewhat related question: why did you choose to represent the numeric vertex IDs as character strings? Is there some advantage to this in your C++ code?

This is a great package, btw. Thanks!

vlarmet commented 1 year ago

Hi,

No, I have not considered using Arrow. To be honest, I don't know exactly what it is but I will look at it. Internally, vertex IDs are integers from 0 to nbnode-1. Character type is more flexible for the user because anything can be converted into character.

kbvernon commented 1 year ago

Hi, thanks!

I'm not sure I fully understand the Arrow system either, but playing around with it, I can say it is VERY fast, like data.table fast, and it makes a table VERY small in memory, which is really useful when you're working with a graph that has N > 1 M vertices and 8 * N edges.

library(arrow)
library(cppRouting)
library(lobstr)

roads_dataframe <- read.csv(
  "https://raw.githubusercontent.com/vlarmet/cppRouting/master/data_readme/roads.csv",
  colClasses = c("character", "character", "numeric")
)

roads_arrow <- arrow::arrow_table(roads_dataframe)

lobstr::obj_size(roads_dataframe)
# 29.59 MB

lobstr::obj_size(roads_arrow)
# 284.64 kB

vlarmet / cppRouting

Apache Arrow? #13