Closed nuest closed 6 years ago
I saw this already, it can be added as new regex(es) at
Take only those that exists is implemented:
An elegant solution would use the minimal number of regexes necassery to cover the maximum of imaginable ways to read an input file in R.
I also thought about a more greedy implementation that matches all values in the target R code and checks which of them are also present in the target workspace files.
The "very" greedy variant would not distinguish between input and output.
Correct, but we cannot always know that for sure anyway:
imagine something like
data = "dataset.csv"
load(data)
data = abc(data)
save(data)
not unless the extraktor interprets what is actually happening in the code. it was designed to make suggestions for the obvious information, there is no heuristics, hence the restriction to regex
I can live with the simplification, but then the extraction output should also not distinguish inputfiles
from outputfile
but just say ... used_files
?
Btw, the example above - not a reproducible analysis.
no no, thats not what I meant, the distinction between inputfiles
and outputfiles
appears very useful to me. the point is, the user has to name the inputfiles. the extraktor can facilitate this process by making suggestions based on the files in the workspace that have been seen in the code. that is by all means no vote for the greedy way to gather filesnames.
Btw, the example above - not a reproducible analysis.
yes, the example does not analyse, ergo there is no reproducible analysis. I did not even check if this was R, following the purpose of my illustration ;-)
Okay I devised a regex that could deal with most cases:
read[r\.\:\_].*file\=[\"\']{0,1}([0-9A-Za-z\,\.\:\/\\]*)[\"\']{1,1}
example:
a1 <- read.csv(header=TRUE, file="c:/myData.csv", sep=",")
a2 <- read.csv2(file="f:\file.xy", header=TRUE, sep = ";", quote = "\"",
dec = ",", fill = T)
a3 <- read.delim(sep="$", header=FALSE", file="D:/sup/SPOL/values/foo.txt")
a4 <- read.delim2(file="stuff.tsv", header = T, sep = "\t", quote = "\"",
dec = ",", fill = TRUE, comment.char = "")
a5 <- readr::read_csv(file="mt,cars.csv.zip")
a6 <- read.table(file="abcde.csv", header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
yields:
c:/myData.csv
f:\file.xy
D:/sup/SPOL/values/foo.txt
stuff.tsv
mt,cars.csv.zip
abcde.csv
Try out on https://pythex.org (deep link did not work for this).
Restrictions:
.
symbol). dotall is default for my parser, hence would require to redesign the matching function.file=
+ quotes. since file
has no fixed position in the function arguments, it's hard to identify when no argument keyword is used. The quotes I need because there could be non alphanumeric chars in the pathname (filename), e.g. ,
.drømmehus.txt
or данные.csv
The regex can also not handle the common (and recommended) style of putting spaces around the =
:
data <- read.csv(file = "data.csv")
I noticed the dotall restriction only accounts for matching headers and codeblocks in rmd and yaml, so the regex is indeed patchable. I modified it in the last commit to also match spaced \=
. Not perfect but it enables us to find a whole new bunch of inputfiles.
I also added a new regex for files that are written in the R code as requested by @MarkusKonk . Since these are not inputfiles in a narrow sense, we might want to consider a new category ("output" ?) later.
Noted ideas about outputfiles at https://github.com/o2r-project/o2r-meta/issues/99
Right now
extract
notices onyread.csv("name of file here")
. So it did not suggest an inputfile in an R Markdown document usingSo
extract
does not work with named function arguments AFAICS.We cannot cover all cases here (like, when the file name is generated by a function, or when a read statement is formatted across multiple lines), but maybe we can cover a bit more nevertheless. I suggest the following approach, as I hope it might reduce the need to cover both named and unnamed parameters manually (which of course is also an alternative):
r_input
)read.csv2
read.delim
read.delim2
readr::read_*
read.table
R -e "?open"
file
url
gzfile
bzfile
xzfile
unz
open
"