Improve handling of data (and information) files, relating to encoding, delimiters

morungos commented 8 months ago

We need to improve support for input files, because users are facing some challenges relating to input data.

ICES data tends to arrive as tab-delimited, but with UTF-8 encoding. This can be read by Excel, but not written by Excel (not now anyway, only in old versions). In any event, we really need to allow tab-delimited files to be read, both with and without UTF-8 encoding.
CSV files are much easier for Excel to manage, and again, may be read and written with or without UTF-8 encoding.
get_RECO appears to implement some most unusual logic for ICES data, and reads in PARAM files as lines, and then does its own comma splitting, so it can fix up the Description field if it contains unquoted commas. This is poor form on the part of ICES -- if it is the case.
We could always improve file validation, to reduce the risks of analysis proceeding in invalid ways.
We are still using read_excel in a few places in add_non_ICES_data -- no good reason for this that I can see, although I still think we might be better allowing folks to use Excel files, so long as we don't require any dependencies to do that in HARSAT proper. In particular, readxl should really not be a blocker for anyone, since it is a common part of the tidyverse. (I heard the discussion of Java, but readxl claims no external dependencies -- suspect this might be a legacy issue from Open Office/LibreOffice era integrations.)
We are using both readr::read_csv and read.csv, but readr::read_csv does not support encodings
File extensions are not massively reliable, we should probably allow options for delimiter, quotes, etc., to be passed -- these can be inferred from extensions as a fallback
The worst case is that we encounter different formats for different files, i.e., tab-delimited, ASCII for contaminants, UTF-8 CSV for stations. Unfortunately, that probably isn't all that improbable.
In all cases, UTF-8 is a superset of ASCII, so if we always read UTF-8, we are safe -- put another way, it is almost always wrong to assume data is not encoded. In theory, data can be encoded something other than UTF-8, but that is a problem for someone else and another day. This basically means that read_csv and read_delim should not be used, at all.
Note that different information files may be written in different formats, in exceptional (and maybe not so exceptional) circumstances.
Also, we still need to add checksums. See #349

So, to concrete proposals:

Allow stations and contaminants to be passed to read_data as filenames or as data frames -- this enables users to intervene and handle formats differently.
Essentially, we need to pass all file reading to a single underlying function, which can read anything, and which allows some additional configuration to control delimiters, etc. This will very much be an "advanced users only" thing, but may be required. However, by and large, a safe default will be ".txt" -> read as tab delimited UTF-8, and ".csv" -> read as comma-delimited/quoted UTF-8. This is because any ASCII file can be read safely as UTF-8 anyway. Then, we can allow a reader function to be passed to read_data to customize this behaviour.
Underneath, we should replace all the file access we can with fread(), and allow any of the options somehow to get injected as appropriate.
Change the pattern of reading a file's header, checking, followed by reading its body, into using fread() and then checking column names as part of a possibly deeper validation process. There doesn't seem to be any benefit to this two-pass approach.
This is a major internal reorganization, and especially will require testing against ICES data, but should address all of the above.

morungos commented 8 months ago

@RobFryer Can you review this and see what you think?

The bit that fills me with dread is the code in get_RECO that implies ICES (or something) is generating invalid CSV. But I cannot trace any calls to this, from anywhere, so maybe that is not our problem anyway.

morungos commented 8 months ago

Some notes:

When reading TSV files, we're defaulting to allow quotes. This is not correct for ICES data. For ICES data, we should pass quote = "", so that we don't strip quotes. That's a sensible default for tab-delimited data
fread might be less useful than it seemed, as we need tibbles.

RobFryer commented 8 months ago

One thing I don't want to lose is the facility to read in each column with a specified class (e.g. character, logical, numeric, integer). I presume fread does this.

If fread produces a tibble, then we can just turn that into a data.frame immediately afterwards, so that shouldn't be an issue

morungos commented 8 months ago

I agree, main reason for the changes I've made are that data.table and fread were proving much harder, as row names are gone, and a lot of logic depends on them. As it is, tibbles "just work". I have code that works fine now, but the quoting is more important than I thought. Any time I've used tab files, there's been no quoting. Here, in some of the files, there is, but it is not consistent. So we do need to support it.

Anyway, I have managed to switch all file reading to use read.table with various parameters, so hopefully today I'll get something that does the encoding checks too, and get started on additional validations.

morungos commented 8 months ago

Also, I would not exclude the possibility of allowing Excel. readxl is not bad. It doesn't require too many weird dependencies, especially Java issues. That used to be the case in the old days, before MS opened up the documentation of its strange file formats. I will try, but I think allowing Excel is now a relatively small fix, probably 5-10 lines or so, and readxl is part of the tidyverse, so not a weird dependency to add at this point.

What do you think? @RobFryer, @swamap? If I can make Excel light enough to be low on pain, would that be a useful addition? Note that a major benefit of Excel is that encodings are not an issue any more -- it's a binary format.

morungos commented 8 months ago

Did a quick test with @swamap's AMAP stations file, and lifting it into Excel, all seems to work as before. I would be very happy leaving it in, for now, although maybe being clear that XLS/XLSX is experimental right now.

Note that there is native code to this extension, but no Java. It was a quick install.

osparcomm / HARSAT

Improve handling of data (and information) files, relating to encoding, delimiters #350