Parsing of time units fails with multiple white spaces

pacificclimate / ncdf4.helpers

Routines to make using NetCDF files in R cleaner and easier.

5 stars 1 forks source link

Parsing of time units fails with multiple white spaces #8

Closed lochbika closed 2 years ago

lochbika commented 2 years ago

Hi,

version: 0.3-6

I just found out that the parsing of the time unit string in the nc.get.time.series function fails if the string contains multiple spaces. For instance:

days since 850-1-1 00:00:00

is parsed perfectly fine. However

days since           850-1-1 00:00:00

throws an error:

Error in as.POSIXlt.character(x, tz, format, ...) : 
  character string is not in a standard unambiguous format

The source of the error is the two white spaces between "since" and "850-1-1". I guess strsplit returns empty elements in the split vector time.split.

A possible workaround could be to first run time.split <- strsplit(f$dim$time$units, " ")[[1]] and then drop empty elements with something like time.split <- time.split[time.split != ""]

Cheers, Kai

P.S. great work! I love the comments in the source code ;)

jameshiebert commented 2 years ago

Hey Kai,

Hmmm, that's definitely an unusual case. Not sure why someone would have arbitrarily long space listed in their time units. That's not explicitly covered by the CF Conventions spec.

Admittedly the canonical implementation of units parsing, udunits2, does parse your case:

> library(udunits2)
udunits system database read
> ud.is.parseable('days since 850-1-1 00:00:00')
[1] TRUE
> ud.is.parseable('days since   850-1-1 00:00:00')
[1] TRUE

And it since strsplit takes a regular expression, it's easy to split on one or more spaces with a one character change:

> strsplit('days since   850-1-1 00:00:00', ' +')
[[1]]
[1] "days"     "since"    "850-1-1"  "00:00:00"

I'll push a patch, but it's unlikely that it will get incorporated into a release in the near future. Thanks for the report, though.

lochbika commented 2 years ago

Hi James, thanks for the quick reply. The solution with regex is great!

Hmmm, that's definitely an unusual case. Not sure why someone would have arbitrarily long space listed in their time units. That's not explicitly covered by the CF Conventions spec.

I also wouldn't expect it. But it happened while I was reading some CMIP5 output into R. It was just a double space which is apparently enough, however.

Thanks for the fix.