leeper / csvy

Import and Export CSV Data With a YAML Metadata Header
57 stars 3 forks source link

Performance improvements #9

Open hadley opened 6 years ago

hadley commented 6 years ago

Currently read_csvy reads the complete file using readLines() - this means it will be slow for large files. I'd recommend (and can possibly help with) writing a C/C++ read_yaml_header() function that would parse from the first --- to the next ---. This metadata could then be used to generate the column specification that's passed to read.csv(), read_csv(), and fread(). (Will probably still need some additional cleanup afterwards).

leeper commented 6 years ago

That would be awesome.

ashiklom commented 6 years ago

Not in C, but a first pass at this might look something like this. It uses the fact that if con <- file("/path/to/file", "r") then readLines(con, n = 1) reads a file one line at a time, automatically advancing to the next line.

get_yaml_header <- function(filename, yaml_rxp = "^#?---[[:space:]]*$") {
  con <- file(filename, "r")
  on.exit(close(con))
  first_line <- readLines(con, n = 1)
  if (!grepl(yaml_rxp, first_line)) {
    warning("No YAML file found.")
    return(NULL)
  }
  iline <- 2
  closing_tag <- FALSE
  tag_vec <- character()
  while (!closing_tag) {
    curr_line <- readLines(con, n = 1)
    tag_vec[iline - 1] <- curr_line
    closing_tag <- grepl(yaml_rxp, curr_line)
    iline <- iline + 1
  }
  tag_vec[seq_len(iline - 2)]
}

parse_yaml_header <- function(yaml_header) {
  if (all(grepl("^#", yaml_header))) {
    yaml_header <- gsub("^#", "", yaml_header)
  }
  yaml::yaml.load(paste(yaml_header, collapse = "\n"))
}

raw_header <- get_yaml_header("iris.csvy")
metadata <- parse_yaml_header(raw_header)

You should then be able to do something like csv_file <- fread(filename, skip = length(tag_vec) + 2, ...).

If this looks OK, I can try to put together a more complete pull request later this week.

leeper commented 6 years ago

That would be awesome!

leeper commented 6 years ago

Merging of #15 is done. We could do further C-level fixes, but this seems good for the time being.