Open hadley opened 6 years ago
That would be awesome.
Not in C, but a first pass at this might look something like this. It uses the fact that if con <- file("/path/to/file", "r")
then readLines(con, n = 1)
reads a file one line at a time, automatically advancing to the next line.
get_yaml_header <- function(filename, yaml_rxp = "^#?---[[:space:]]*$") {
con <- file(filename, "r")
on.exit(close(con))
first_line <- readLines(con, n = 1)
if (!grepl(yaml_rxp, first_line)) {
warning("No YAML file found.")
return(NULL)
}
iline <- 2
closing_tag <- FALSE
tag_vec <- character()
while (!closing_tag) {
curr_line <- readLines(con, n = 1)
tag_vec[iline - 1] <- curr_line
closing_tag <- grepl(yaml_rxp, curr_line)
iline <- iline + 1
}
tag_vec[seq_len(iline - 2)]
}
parse_yaml_header <- function(yaml_header) {
if (all(grepl("^#", yaml_header))) {
yaml_header <- gsub("^#", "", yaml_header)
}
yaml::yaml.load(paste(yaml_header, collapse = "\n"))
}
raw_header <- get_yaml_header("iris.csvy")
metadata <- parse_yaml_header(raw_header)
You should then be able to do something like csv_file <- fread(filename, skip = length(tag_vec) + 2, ...)
.
If this looks OK, I can try to put together a more complete pull request later this week.
That would be awesome!
Merging of #15 is done. We could do further C-level fixes, but this seems good for the time being.
Currently
read_csvy
reads the complete file usingreadLines()
- this means it will be slow for large files. I'd recommend (and can possibly help with) writing a C/C++read_yaml_header()
function that would parse from the first---
to the next---
. This metadata could then be used to generate the column specification that's passed toread.csv()
,read_csv()
, andfread()
. (Will probably still need some additional cleanup afterwards).