Closed morungos closed 8 months ago
@RobFryer Can you review this and see what you think?
The bit that fills me with dread is the code in get_RECO
that implies ICES (or something) is generating invalid CSV. But I cannot trace any calls to this, from anywhere, so maybe that is not our problem anyway.
Some notes:
quote = ""
, so that we don't strip quotes. That's a sensible default for tab-delimited datafread
might be less useful than it seemed, as we need tibbles. One thing I don't want to lose is the facility to read in each column with a specified class (e.g. character, logical, numeric, integer). I presume fread does this.
If fread produces a tibble, then we can just turn that into a data.frame immediately afterwards, so that shouldn't be an issue
I agree, main reason for the changes I've made are that data.table
and fread
were proving much harder, as row names are gone, and a lot of logic depends on them. As it is, tibbles "just work". I have code that works fine now, but the quoting is more important than I thought. Any time I've used tab files, there's been no quoting. Here, in some of the files, there is, but it is not consistent. So we do need to support it.
Anyway, I have managed to switch all file reading to use read.table
with various parameters, so hopefully today I'll get something that does the encoding checks too, and get started on additional validations.
Also, I would not exclude the possibility of allowing Excel. readxl
is not bad. It doesn't require too many weird dependencies, especially Java issues. That used to be the case in the old days, before MS opened up the documentation of its strange file formats. I will try, but I think allowing Excel is now a relatively small fix, probably 5-10 lines or so, and readxl
is part of the tidyverse, so not a weird dependency to add at this point.
What do you think? @RobFryer, @swamap? If I can make Excel light enough to be low on pain, would that be a useful addition? Note that a major benefit of Excel is that encodings are not an issue any more -- it's a binary format.
Did a quick test with @swamap's AMAP stations file, and lifting it into Excel, all seems to work as before. I would be very happy leaving it in, for now, although maybe being clear that XLS/XLSX is experimental right now.
Note that there is native code to this extension, but no Java. It was a quick install.
We need to improve support for input files, because users are facing some challenges relating to input data.
get_RECO
appears to implement some most unusual logic for ICES data, and reads inPARAM
files as lines, and then does its own comma splitting, so it can fix up theDescription
field if it contains unquoted commas. This is poor form on the part of ICES -- if it is the case.read_excel
in a few places inadd_non_ICES_data
-- no good reason for this that I can see, although I still think we might be better allowing folks to use Excel files, so long as we don't require any dependencies to do that in HARSAT proper. In particular,readxl
should really not be a blocker for anyone, since it is a common part of the tidyverse. (I heard the discussion of Java, butreadxl
claims no external dependencies -- suspect this might be a legacy issue from Open Office/LibreOffice era integrations.)readr::read_csv
andread.csv
, butreadr::read_csv
does not support encodingsread_csv
andread_delim
should not be used, at all.So, to concrete proposals:
stations
andcontaminants
to be passed toread_data
as filenames or as data frames -- this enables users to intervene and handle formats differently.read_data
to customize this behaviour.fread()
, and allow any of the options somehow to get injected as appropriate.fread()
and then checking column names as part of a possibly deeper validation process. There doesn't seem to be any benefit to this two-pass approach.