pelson / pyudunits2

A pure Python library designed for handling units of physical quantities, fully based on the UDUNITS2 grammar and XML database
Apache License 2.0
8 stars 1 forks source link

Figure out what to do with dates #4

Open pelson opened 4 weeks ago

pelson commented 4 weeks ago

Currently the grammar doesn't support dates, and all of the functionality around that would need to be implemented.

We need to make a decision on what to do with this - should we out-source it to cftime, or integrate it so that pyudunits2 is a one-stop-shop for unit handling.

Note this also came up in https://github.com/ioos/compliance-checker/pull/1118/files#diff-e7a4b63dd33717c4e657b158dab5b6b9db5178a5ed9dc608f118dcbc25dec494R328.

cc @ocefpaf for your thoughts.

ocefpaf commented 4 weeks ago

I'm inclined to let cftime handle that.

ocefpaf commented 3 weeks ago

A few comments to share my experience when trying to use cftime as an alternative. Note that I may be doing something wrong!

Workaround 1: I could not find a way to do only "units" as defined by COARDS definition, only full dates conversion that requires a calendar, cf_units could do that, no calendar needed. I'm also using a private method from cftime, not ideal.

xref.: https://github.com/ioos/compliance-checker/pull/1118/files#diff-4af926901344d50518fc4386b31673c7bfa2a64a0752088963b2b1596ec32be8R2048

Workaround 2: We used is_long_time_interval() from cf-units to check for "bad" months/years since in the units. That method is deprecated in cf-units, so we need to update that check regardless.

xref.: https://github.com/ioos/compliance-checker/pull/1118/files#diff-88d3dfc7498015aa900cc3050d5e5a9004adbf8d5b3d732185f90bd97de75862L1868

Workaround 3: We make use of is_convertible_to and is_time_reference from cf-units. That makes me wonder if we should wrap cftime in pyudunits2 for a smoother user experience when moving from cf-units.

Workaround 4: cftime is strict about 60 secs, cf-units was not! In a way, that is wrong and we should fix out test.

xref.: https://github.com/ioos/compliance-checker/pull/1118/files#diff-bbc2203122ce6acdb68da4bf8c0493966a5532a17f80421ff805c447106fd325R1301

Hope I'm giving you something useful and not just noise 😬

pelson commented 3 weeks ago

I'm in the process of updating the pyudunits2 grammar to support calendar units properly. It is clear that in udunits2 this was implemented after the grammar was defined, since it is quite a long way from reality. Note that I had tests in place knowing that there was work to do for the example of timezone https://github.com/pelson/pyudunits2/blob/f75f1afd27d4b66ef77fc9db9301ae914111d288/pyudunits2/tests/_grammar/test_parse.py#L246. Past me also warned current me about taking this implementation on... https://github.com/pelson/pyudunits2/blob/f75f1afd27d4b66ef77fc9db9301ae914111d288/pyudunits2/_expr_graph.py#L150. Past me wasn't wrong about it being gnarly :joy: .

In udunits2 the following specification is seemingly valid (but wild): hours since 2000-01-01 23:59:59.60.2. If you parse this with cftime it looks like it gets transformed incorrectly:

>>> dt = cftime._dateparse('hours since 2000-01-01 23:59:59.60.2', calendar='standard')         
>>> dt.isoformat()
'2000-01-01T23:59:59.600000'

https://github.com/ioos/compliance-checker/pull/1118/files#diff-bbc2203122ce6acdb68da4bf8c0493966a5532a17f80421ff805c447106fd325L1301

I see in the CF spec this is covered explicitly:

A date/time in the excluded range must not be used as a reference date/time e.g. seconds since 2016-12-31 23:59:60 is not a permitted value for units

I'm thinking that in pyudunits2 we should either:

Having thought about this a little, I think I will make a date-time-timezone object, and stop there. It will be sufficient to know that you have a date-like unit, and then easy enough to build a datetime from it subsequently.

pelson commented 3 weeks ago

So over the last week I put in a large number of hours working on proper date parsing in pyudunits2. Unfortunately, it required deep re-factoring of the parser and I didn't manage to successfully reach the goal (and found many udunits2 bugs en-route). Frustratingly, I scrapped all that work, and went back to basics - I have now added support for parsing dates with timezones (the original issue that came up), as well as adding the ability to know if a unit is "time-like". From there, it should be easy for me to add the ability to determine if a unit is date-time based. Furthermore, it should be easy enough to expose the raw string that was parsed in that date specification, so that we can hand it over to tools like cftime. Some thought on how to expose that in an interface is still needed.

In short: you should now be able to at least parse and recognise units with calendars/date-time info. It may be slightly problematic for you at this moment though, since now it parses but there is no easy way to determine if the thing is convertible (and I have no doubt that there will be some deep errors when you try to do things like convert date-time units to something else).

I think I may go back and pick out some of the work I did in my failed attempt, so that we can get some coherent representation of a date-reference instead of repeating it back verbatim.

ocefpaf commented 3 weeks ago

I have no doubt that there will be some deep errors when you try to do things like convert date-time units to something else

In compliance-checker we are abusing the .is_convertible() method to compare units. Maybe converting dates-units should not be in the scope here to make it simple? Right now we would get an inconsistent behavior in compliance checker:

cf_units.Unit("seconds since 1980-01-19").is_convertible("hours since 1980-01-19")
True

cf_units.Unit("seconds since 1980-01-19").is_convertible("hours")
False

ut_system.unit("seconds since 1980-01-19").is_convertible_to(ut_system.unit("hours since 1980-01-19"))
True

ut_system.unit("seconds since 1980-01-19").is_convertible_to(ut_system.unit("hours"))
True

TL;DR IMO we should fix compliance-checker to check if the units are time references and not just "time."


Edit:

Maybe we need to differentiate from time and time reference like cf_units does:

u = cf_units.Unit("seconds since 1980-01-19")
u.is_time()
False  # This is True in pyudunits2

u.is_time_reference()
True  # This doesn't exist in pyudunits2
ocefpaf commented 3 weeks ago

Just hit a corner case when parsing iso dates. Let me know if I should open new issues, hold off for now, or report these here.

# Passes
cf_units.Unit("days since 1970-01-01T00:00:00 UTC")
Unit('days since 1970-01-01T00:00:00', calendar='standard')

# Fails
ut_system.unit("days since 1970-01-01T00:00:00 UTC")

  File inline:1
    'days since 1970-01-01T00:00:00 UTC'
                             ^
SyntaxError: mismatched input ':' expecting <EOF>
pelson commented 3 weeks ago

Just hit a corner case when parsing iso dates. Let me know if I should open new issues, hold off for now, or report these here.

A comment is fine for now. If you find it is backing up, open an issue (and feel free to clump multiple issues into one if helpful)

I just fixed the iso issue in #6

pelson commented 2 weeks ago

Note to self: Date units look a lot like reference points, but it is a mistake to use units for this purpose.

To give an example, you can't use a date as a reference for a non time unit:

meters since 2000-01-01

(i.e. this isn't a measure of the distance moved since a certain date)

However:

seconds since 2000-01-01

Feels a lot like a date serialisation format... yet principally the two are equivalent in what they represent (the number of units since the given date).

Because we know how many time units there are in a date (if we know the calendar), we can shift the date reference and apply it to the unit value:

$ udunits2 -H 'hours since 2000-01-01' -W 'minutes since 2000-01-01'
    1 hours since 2000-01-01 = 60 (minutes since 2000-01-01)
    x/(minutes since 2000-01-01) = 60*(x/(hours since 2000-01-01))

Yet the equally convertible form for meters is not supported:

$ udunits2 -H 'meters since 2000-01-01' -W 'kilometers since 2000-01-01'
udunits2: Don't recognize "meters since 2000-01-01"

Clearly, when it comes to understanding calendars we have to be able to associate the unit with the date:

$ udunits2 -H 'hours since 2000-01-01' -W 'hours since 2000-01-02'
    1 hours since 2000-01-01 = -23 (hours since 2000-01-02)
    x/(hours since 2000-01-02) = (x/(hours since 2000-01-01)) - 24

But if you accidentally forget that this date isn't really a "reference point", you can easily get unexpected results. Imagine you had been measuring the growth of a plant from a particular date, and naively tried to shift the reference point (a nonsensical question without more information [just like date shifting without a calendar is nonsensical]):

$ udunits2 -H 'mm / (day since 2000-01-01)' -W 'mm / (day since 2000-01-02)'
    1 mm / (day since 2000-01-01) = 1 (mm / (day since 2000-01-02))
    x/(mm / (day since 2000-01-02)) = (x/(mm / (day since 2000-01-01)))

Woops... your growth rate is date invariant. This should really have been represented as a unit of mm / day with the reference point being metadata that belongs outside of the unit. For this reason, I'm planning to prohibit date units other than those which are declared at the top level (e.g. days since 2000-01-01). Anything like ms per day since 2000-01-01 will be rejected (as will meters since 2000-01-01). There will be a special date unit class to represent this, as the operations permitted are not the same as those on a normal unit.

Note that there is an ambiguity in days since 20000101 which will be preserved, and may continue to produce nonsensical results:

$ udunits2 -H 'mm / day since 20000101' -W 'mm/ day since 20000201'
    1 mm / day since 20000101 = -99 (mm/ day since 20000201)
    x/(mm/ day since 20000201) = (x/(mm / day since 20000101)) - 100

In the future, I may try to prohibit this by introducing a unit concept of relative scale or delta units. I think that is a bigger topic that I won't try to fix here though.

ocefpaf commented 2 weeks ago

Note to self: Date units look a lot like reference points, but it is a mistake to use units for this purpose.

OK. Point taken 😄

Not that, IMO, it is a mistake to differentiate those two time formats in that manner. I'm old and I'm from a time where "time-units since some-date" was just called COARDS time units, and treated as time units like the rest. I'll change the logic in compliance-checker to not rely on any is_time_reference or similar. I agree with you that you should not implement that in pyudunits2.

pelson commented 2 weeks ago

I agree with you that you should not implement that in pyudunits2.

Hold your horses :smile:. I have a DateUnit in the pipeline for precisely the purpose you suggested. Just note that you can't multiply DateUnit with another unit, for example. Should have more detail this week.

ocefpaf commented 2 weeks ago

Hold your horses 😄. I have a DateUnit in the pipeline for precisely the purpose you suggested. Just note that you can't multiply DateUnit with another unit, for example. Should have more detail this week.

You are always 10 steps ahead of me.

BTW, I believe we should be good for a first release. Even if you want to do an alpha one. I'd like to get more people suing pyudunits2 sooner rather than later. It has an uphill battle to de-throne years of udunits2 work-around code.

pelson commented 2 weeks ago

I have now exposed a DateUnit and the ability to access the reference_date on that type. For now, I can only expose the raw value, but plan to turn the type into an interpreted DateTime in the next few days.

>>> import pyudunits2
>>> us = pyudunits2.UnitSystem.from_udunits2_xml()
>>> u = us.unit('s since 2000')
>>> type(u)
<class 'pyudunits2.DateUnit'>
>>> u.reference_date.raw_content
'2000'
>>> print(u.unit)
s

BTW, I believe we should be good for a first release

I'd like to bed down this date implementation (for example, until I wrote the above, you couldn't access unit on the DateUnit), then we can proceed with an alpha release. Let's aim for the beginning of next week.