Extend detection of loaded CSV files in R

nuest commented 6 years ago

Right now extract notices ony read.csv("name of file here"). So it did not suggest an inputfile in an R Markdown document using

data <- read.csv(file = "data.csv")

So extract does not work with named function arguments AFAICS.

We cannot cover all cases here (like, when the file name is generated by a function, or when a read statement is formatted across multiple lines), but maybe we can cover a bit more nevertheless. I suggest the following approach, as I hope it might reduce the need to cover both named and unnamed parameters manually (which of course is also an alternative):

add some more matches for function names for loading data (to be listed in r_input)
- read.csv2
- read.delim
- read.delim2
- readr::read_*
- read.table
- based on R -e "?open"
- file
- url
- gzfile
- bzfile
- xzfile
- unz
- open
extract all potential file names from the call, i.e. look for everything that is witin "
remove all file names from the list where no file of that name exists

ghost commented 6 years ago

I saw this already, it can be added as new regex(es) at

https://github.com/o2r-project/o2r-meta/blob/dev/parsers/parse_rmd.py#L42

Take only those that exists is implemented:

https://github.com/o2r-project/o2r-meta/blob/dev/extract/metaextract.py#L305

An elegant solution would use the minimal number of regexes necassery to cover the maximum of imaginable ways to read an input file in R.

I also thought about a more greedy implementation that matches all values in the target R code and checks which of them are also present in the target workspace files.

nuest commented 6 years ago

The "very" greedy variant would not distinguish between input and output.

ghost commented 6 years ago

Correct, but we cannot always know that for sure anyway:

imagine something like

data = "dataset.csv" 
load(data)
data = abc(data)
save(data)

not unless the extraktor interprets what is actually happening in the code. it was designed to make suggestions for the obvious information, there is no heuristics, hence the restriction to regex

nuest commented 6 years ago

I can live with the simplification, but then the extraction output should also not distinguish inputfiles from outputfile but just say ... used_files ?

nuest commented 6 years ago

Btw, the example above - not a reproducible analysis.

ghost commented 6 years ago

no no, thats not what I meant, the distinction between inputfiles and outputfiles appears very useful to me. the point is, the user has to name the inputfiles. the extraktor can facilitate this process by making suggestions based on the files in the workspace that have been seen in the code. that is by all means no vote for the greedy way to gather filesnames.

Btw, the example above - not a reproducible analysis.

yes, the example does not analyse, ergo there is no reproducible analysis. I did not even check if this was R, following the purpose of my illustration ;-)

ghost commented 6 years ago

Okay I devised a regex that could deal with most cases:

read[r\.\:\_].*file\=[\"\']{0,1}([0-9A-Za-z\,\.\:\/\\]*)[\"\']{1,1}

example:

a1 <- read.csv(header=TRUE, file="c:/myData.csv", sep=",")
a2 <- read.csv2(file="f:\file.xy", header=TRUE, sep = ";", quote = "\"",
          dec = ",", fill = T)
a3 <- read.delim(sep="$", header=FALSE", file="D:/sup/SPOL/values/foo.txt")
a4 <- read.delim2(file="stuff.tsv", header = T, sep = "\t", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "")
a5 <- readr::read_csv(file="mt,cars.csv.zip")
a6 <- read.table(file="abcde.csv", header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

yields:

c:/myData.csv
f:\file.xy
D:/sup/SPOL/values/foo.txt
stuff.tsv
mt,cars.csv.zip
abcde.csv

Try out on https://pythex.org (deep link did not work for this).

Restrictions:

doesn't work with dotall flag (i. e. newline char included in catchall . symbol). dotall is default for my parser, hence would require to redesign the matching function.
can't find matches where r function is called without file= + quotes. since file has no fixed position in the function arguments, it's hard to identify when no argument keyword is used. The quotes I need because there could be non alphanumeric chars in the pathname (filename), e.g. ,.
can't find more than standard ascii alphanumeric filenames, e.g. not drømmehus.txt or данные.csv

nuest commented 6 years ago

The regex can also not handle the common (and recommended) style of putting spaces around the =:

data <- read.csv(file = "data.csv")

ghost commented 6 years ago

I noticed the dotall restriction only accounts for matching headers and codeblocks in rmd and yaml, so the regex is indeed patchable. I modified it in the last commit to also match spaced \=. Not perfect but it enables us to find a whole new bunch of inputfiles.

I also added a new regex for files that are written in the R code as requested by @MarkusKonk . Since these are not inputfiles in a narrow sense, we might want to consider a new category ("output" ?) later.

nuest commented 6 years ago

Noted ideas about outputfiles at https://github.com/o2r-project/o2r-meta/issues/99

o2r-project / o2r-meta

Extend detection of loaded CSV files in R #81