os.stat(): add new fields to get timestamps as Decimal objects with nanosecond resolution

19e35e4c-f8b0-41ed-9bf1-d6341832c748 commented 13 years ago

BPO	11457
Nosy	@loewis, @rhettinger, @jcea, @mdickinson, @abalkin, @gustaebel, @vstinner, @larryhastings, @bitdancer, @skrah
Dependencies	bpo-11941: Support st_atim, st_mtim and st_ctim attributes in os.stat_result
Files	larry.decimal.utime.patch.1.txt: First revision time_integer.patch time_decimal.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'os.stat(): add new fields to get timestamps as Decimal objects with nanosecond resolution' updated_at = user = 'https://bugs.python.org/khenriksson' ``` bugs.python.org fields: ```python activity = actor = 'larry' assignee = 'rhettinger' closed = True closed_date = closer = 'larry' components = ['Library (Lib)'] creation = creator = 'khenriksson' dependencies = ['11941'] files = ['23246', '24309', '24321'] hgrepos = [] issue_num = 11457 keywords = ['patch'] message_count = 65.0 messages = ['130478', '130479', '130596', '134642', '137558', '137578', '137580', '137593', '137599', '137600', '137606', '137608', '137877', '137888', '138978', '138979', '138980', '138984', '138987', '139169', '139321', '143573', '143644', '143738', '143739', '143801', '143802', '143803', '143805', '143807', '143811', '143812', '143819', '143820', '143837', '143866', '143867', '143868', '143873', '143881', '143885', '143898', '144543', '144607', '145256', '145262', '145288', '151872', '151873', '151912', '151943', '151987', '151992', '152003', '152004', '152305', '152306', '152314', '152317', '152320', '152322', '152323', '152350', '152355', '154405'] nosy_count = 16.0 nosy_names = ['loewis', 'rhettinger', 'jcea', 'mark.dickinson', 'belopolsky', 'lars.gustaebel', 'vstinner', 'larry', 'nadeem.vawda', 'Arfrever', 'r.david.murray', 'skrah', 'Alexander.Belopolsky', 'rosslagerwall', 'khenriksson', 'ericography'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'test needed' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue11457' versions = ['Python 3.3'] ```

19e35e4c-f8b0-41ed-9bf1-d6341832c748 commented 13 years ago

The most recent (issue 7) release of the POSIX standard mandates support for nanosecond precision in certain system calls. For example, the stat structure include a timespec struct for each of mtime, atime, and ctime that provides such nanosecond precision.[1] There is also an futimens call that allows setting the time accurate to the nanosecond.[2] Support for such precision is available at the least on 2.6 Linux kernels.

Currently, the Python float type is used everywhere to express times in a single value (such as the result from os.stat). However, since this is implemented in CPython using type double (and possibly similarly elsewhere) it is impossible to obtain sufficient precision using a float.

Therefore, it would be useful to expose the number of seconds and nanoseconds separately to allow full precision. Perhaps adding these values as additional members to the return value from os.stat would be most useful, something like .st_atimensec that Linux sometimes uses, or else just follow the POSIX standard to include a sub-struct.

This is important for example with the tarfile module with the pax tar format. The POSIX tar standard[3] mandates storing the mtime in the extended header (if it is not an integer) with as much precision as is available in the underlying file system, and likewise to restore this time properly upon extraction. Currently this is not possible.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html

19e35e4c-f8b0-41ed-9bf1-d6341832c748 commented 13 years ago

Also, a new function similar to os.utime would be needed as well, perhaps something named like os.utimens. This would be needed to allow setting times with nanosecond precision.

abalkin commented 13 years ago

See also bpo-10812 which implements os.futimens().

a04be92c-af4e-4c3d-ab01-017f3a697ce8 commented 13 years ago

Closed bpo-11941 as a duplicate of this.

bitdancer commented 13 years ago

The mailbox module would benefit from having this precision available.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

I suggest that rather than using composite time stamps, decimal.Decimal is used to represent high-precision time in Python.

On input to os.utime, the function could just polymorphically accept Decimal, and try its best.

I see three approaches that preserve compatibility for stat (plus the incompatible one of just changing the field types of struct stat):

have a flag in the stat module to change the field types globally. This would be appropriate if the ultimate goal is to eventually change the fields in an incompatible way in Python 4.
have a flag to stat that changes the field types, on a per-call basis
mirror the existing fields, into _decimal versions.

80036ac5-bb84-4d39-8416-02cd8e51707d commented 13 years ago

os.utimensat() and os.futimens() already exist since Python 3.3 and require 2-tuples (or None) as second and third argument.

(utime() is deprecated since POSIX 2008: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/utime.h.html)

(Changes specific to os.stat() are discussed in issue bpo-11941.)

abalkin commented 13 years ago

On Fri, Jun 3, 2011 at 3:57 PM, Martin v. Löwis \report@bugs.python.org\ wrote: ..

I suggest that rather than using composite time stamps, decimal.Decimal is used to represent high-precision time in Python.

I support this idea in theory, but as long as decimal is implemented in Python, os module should probably expose a low level (tuple-based?) interface and a higher level module would provide Decimal-based high-precision time.

BTW, what is the status of cdecimal?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

Am 03.06.2011 22:11, schrieb Arfrever Frehtes Taifersar Arahesis:

Arfrever Frehtes Taifersar Arahesis \Arfrever.FTA@GMail.Com\ added the comment:

os.utimensat() and os.futimens() already exist since Python 3.3 and require 2-tuples (or None) as second and third argument.

"Already since 3.3" means "they don't exist yet". I.e. it isn't too late to change them.

(utime() is deprecated since POSIX 2008: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/utime.h.html)

This is a case where I think Python shouldn't follow POSIX deprecation. In C, you need to change the function name to change the parameter types; not so in Python.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

I support this idea in theory, but as long as decimal is implemented in Python, os module should probably expose a low level (tuple-based?) interface and a higher level module would provide Decimal-based high-precision time.

Can you explain why you think so? I fail to see the connection.

abalkin commented 13 years ago

On Fri, Jun 3, 2011 at 6:13 PM, Martin v. Löwis \report@bugs.python.org\ wrote: ..

> I support this idea in theory, but as long as decimal is implemented > in Python, os module should probably expose a low level (tuple-based?) > interface and a higher level module would provide Decimal-based > high-precision time.

Can you explain why you think so? I fail to see the connection.

One reason is the desire to avoid loading Python module from a C-module. I understand that this ship has already left the port with larger and larger portions of stdlib being implemented in Python, but doing that in a basic module such as os (or rather posix) is likely to cause more problems than what we have in other similar situation. For example, strptime is implemented in a Python module loaded by time and datetime implemented in C. This works, but at a cost of extreme trickery in the test suit and similar problems encountered by sophisticated applications. As far as I remember, some multi-threding issues have never been resolved.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

One reason is the desire to avoid loading Python module from a C-module.

This desire is indeed no guidance for Python development; the opposite is the case. The only exception may be bootstrapping issues, which I claim are irrelevant in this case.

abalkin commented 13 years ago

On Fri, Jun 3, 2011 at 6:52 PM, Martin v. Löwis \report@bugs.python.org\ wrote: ..

> One reason is the desire to avoid loading Python module from a > C-module.

This desire is indeed no guidance for Python development; the opposite is the case.

Can you elaborate on this? I did notice the current trend of mixing software layers and welcoming circular dependencies in Python stdlib, but I am not sure this is a good thing. In the good old times imports inside functions where frowned upon. (And for many good reasons.) Imports from inside C functions seem to be even worse. Tricks like this greatly reduce understandability of the code. The import statements at the top of the module tell a great deal about what the module can and cannot do. When modules can be imported at will as a side-effect of innocuous looking functions (time.strptime is my personal pet peeve), analysis of the programs becomes much more difficult.

The only exception may be bootstrapping issues, which I claim are irrelevant in this case.

It is hard to tell without attempting an implementation, but my intuition is exactly the opposite. I believe parts of the import mechanism have been implemented in Python and it seems to me that os.stat() may need to be available before decimal can be imported.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

Can you elaborate on this?

Not on the tracker; this is appropriate on python-dev.

vstinner commented 13 years ago

I suggest that rather than using composite time stamps, decimal.Decimal is used to represent high-precision time in Python.

Hey, why nobody proposed datetime.datetime objects? Can't we improve the datetime precision to support nanoseconds? I would prefer to have a nice datetime object instead of a integer with an "unknown" reference (epoch). Or does it cost too much (cpu/memory) to create "temporary" datetime objects when the user just want to check for the file mode?

Well, the typical usecase of a file timestamp is to check if a file has been modified (mtime greater than the previous value), or if a file is newer than other one (mtimeA > mtimeB). I don't think that formating the timestamp is the most common usage of os.stat() & friends. float, int tuples and Decimal are all comparable types.

For timestamps arguments (e.g. signal.sigtimedwait, bpo-12303), I would like to be able to pass a tuple (int, int) *or a float*. It is not because the function provides high precision that I need high precision. I bet that most user only need second resolution for signal.sigtimedwait for example.

If you want to pass Decimal: why not, as you want :-) But we have to write a shared function to parse timestamps with a nanosecond resolution (to always accept the same types).

By the way, Windows does also use timestamps with a nanosecond resolution, it's not specific to POSIX! Oh! And Python has a os.stat_float_times(False) function to change globally the behaviour of the stat functions! It remembers other bad ideas like the datetime.accept2dyear, sys.setfilesystemencoding() or sys.setdefaultencoding(). I don't like functions changing globally the behaviour of Python!

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

Hey, why nobody proposed datetime.datetime objects?

datetime.datetime is extremely bad at representing time stamps. Don't use broken-down time if you can avoid it.

By the way, Windows does also use timestamps with a nanosecond resolution, it's not specific to POSIX!

Actually, it doesn't. The Windows filetime data type uses units of 100ns, starting on 1.1.1601.

vstinner commented 13 years ago

datetime.datetime is extremely bad at representing time stamps. Don't use broken-down time if you can avoid it.

I didn't know that datetime is "extremely bad at representing time stamps", could you explain please?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

I didn't know that datetime is "extremely bad at representing time stamps", could you explain please?

there is no easy way to convert it into "seconds since the epoch"
any broken-down time has issues of time stamp ordering in the duplicate hour of switching from DST to normal time
time zone support is flaky-to-nonexistent in the datetime module

vstinner commented 13 years ago

there is no easy way to convert it into "seconds since the epoch"

Ah yes, it remembers me that Alexander rejected my .totimestamp() patch (bpo-2736) because he considers that "Converting datetime values to float is easy":

(dt - datetime(1970, 1, 1)) / timedelta(seconds=1)

I still think that this formula is *not* trivial, and must be a builtin method. For example, the formula is different if your datetime object if an aware instance:

(dt - datetime(1970, 1, 1, tzinfo=timezone.utc)) / timedelta(seconds=1)

When do you need to convert file timestamps to epoch? If we use datetime in os.stat output, we should also accept it as input (e.g. for os.utime).

any broken-down time has issues of time stamp ordering in the duplicate hour of switching from DST to normal time

I understand that it is an issue of the datetime module. Can it be solved, or is there a design issue in the module?

time zone support is flaky-to-nonexistent in the datetime module

Python 3.3 has now a builtin implementation of fixed timezones, but yes, there are still things to be improved (e.g. support timezone names like "CET").

--

I don't have a strong opinion on this issue, I just wanted to know why datetime cannot be used for this issue.

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

> any broken-down time has issues of time stamp ordering in the > duplicate hour of switching from DST to normal time

I understand that it is an issue of the datetime module. Can it be solved, or is there a design issue in the module?

It's an inherent flaw of broken-down time. Don't use that representation; the only true representation of point-in-time is "seconds since the epoch, as a real number" (IMO, of course). Broken-down time has the advantage of being more easily human-readable, but is (often deliberately) incomplete (with the notion of partial time stamps) and text representations are difficult to parse.

I don't have a strong opinion on this issue, I just wanted to know why datetime cannot be used for this issue.

It's a personal preference of me (the strong objection to broken-down time representations). I believe this preference is widely shared, though. Notice how advanced file systems (NTFS, ext2) use seconds-since- the-epoch formats, whereas FAT uses broken-down time. Also notice how the daytime protocol uses broken-down time, and NTP uses seconds-since-the epoch. The major variation point in the latter is whether second fractions are represented as a separate number of not; this is also the issue here. NTP and NTFS use a single number; ext2 uses seconds/nanoseconds. Also notice that NTP does *not* have a unit that is an integral power of ten, but units of 2-32s (ca. 233ps). NTP4 supports a resolution of 2-64s. (To be fair, the way NTP represents time stamps can also be interpreted as a pair of second/subsecond integers).

abalkin commented 13 years ago

On Sun, Jun 26, 2011 at 8:23 AM, Martin v. Löwis \report@bugs.python.org\ wrote: ..

> I understand that it is an issue of the datetime module. Can it be > solved, or is there a design issue in the module?

It's an inherent flaw of broken-down time. Don't use that representation;

Not quite. This is an inherent flaw of expressing time in time zones with DST adjustments. Yet even if there was no DST, using local time for file timestamps is inconvenient because you cannot easily compare timestamps across systems. This is similar to using locale encoding instead of Unicode. However this flaw does not affect timestamps expressed in UTC. UTC is sort of Unicode (or maybe UTF-8) of timezones.

the only true representation of point-in-time is "seconds since the epoch, as a real number" (IMO, of course).

Mathematically speaking, broken down UTC timestamp is equivalent to "seconds since the epoch, as a real number". There are relatively simple mathematical formulas (defined by POSIX) that convert from one representation to the other and back. As long as "real number" and broken down structure contain the sub-second data to the same precision, the two representations are mathematically equivalent. In practice one representation may be more convenient than the other. (This is somewhat similar to decimal vs. binary representation of real numbers.) When performance is an issue "real numbers" may be more optimal than broken down structures, but in most applications datetime/timedelta objects are easier to deal with than abstract numbers.

Broken-down time has the advantage of being more easily human-readable, but is (often deliberately) incomplete (with the notion of partial time stamps) and text representations are difficult to parse.

I am not sure I understand this. ISO timestamps are not more difficult to parse than decimal numbers. I don't think Python supports partial timestamps and certainly partial timestamps would not be appropriate for representing os.stat() fields.

larryhastings commented 13 years ago

This is probably a terrible idea, but: what about using a subclass of float that internally preserves the original sec / usec values? Call it a utime_float for now. os.stat would produce them, and os.utime would look for them--and if it found one it'd pull out the precise numbers.

Type promotion as a result of binary operators: utime_float OP int = utime_float utime_float OP float = degrades to float

I suspect code rarely does math on atime/utime/ctime and then writes out the result. Most of the time they simply propogate the utime values around, comparing them to each other, or setting them unchanged.

For those rare occasions when someone wants to change the fractional part of a utime_float, we could provide a function utime_fractional(int) -> utime_float.

Advantages:

Nobody has to change any Python code. In almost all circumstances they get complete accuracy for free.

Disadvantages:

Complicated.
Is a yucky hack.
Is a terrible idea. (Now I'm sure of it!)

larryhastings commented 13 years ago

Here's a better idea: we add a new IEEE 754-2008 quad-precision float type. The IEEE 754-2008 quad precision float has 1 sign bit, 15 bits of exponent, and 112 bits of mantissa, so it should have enough precision to last utime until humanity transforms itself into a single angelic being of pure light and energy.

GCC has had __float128 since 4.3, Clang has __float128 now too, Intel's compiler has _Quad. It looks like Visual C++ doesn't support it yet--it does support a funny 80-bit float but I don't think Python wants to go there.

I realize a new float type would be a major undertaking, but it seems to me that that's really the right way to do it. Nobody would have to change their code, and it'd behave like the existing float. It'd be just like 2.x, with "int" and "long"!

vstinner commented 13 years ago

timespec is just a structure of two integers, so we should expose it as a simple and efficient Python tuple: (int, int). We can simply expose this type in os.stat, or we can do better by providing an optional callback to convert this tuple to a high level object. It looks like everybodys wants something different at high level (decimal, datetime, float128, ...), so giving the choice of the type to the caller looks to be a good idea :-)

os.stat(fn) => timestamps stored as int os.stat(fn, lambda x: x) => timestamps stored as (int, int)

Callbacks for other data types:

def to_decimal(sec, nsec):
    return decimal.Decimal(sec) + decimal.Decimal(nsec).scaleb(-9)

def to_datetime(sec, nsec):
    # naive, we can do better
    t = sec + nsec*1e-9
    return datetime.datetime.fromtimestamp(t)

def to_float128(sec, nsec):
    return float128(sec) + float128(nsec)*float128(1e-9)

etc.

Using a callback removes also the bootstrap issue: we don't have to prodive to_datetime() in the posix module or in directly in the os module. The datetime module may provide its own callback, or we can write it as a recipe in os.stat documentation.

I don't know how to call this new argument: decode_timestamp? timestamp_callback? ...?

If it is too slow to use a callback, we can take the first option: expose the timestamp as (int, int). For example: os.stat(path, tuple_timestamp=True).

80036ac5-bb84-4d39-8416-02cd8e51707d commented 13 years ago

I suggest to have low-level, POSIX-compatible, (int, int)-based interface in os module and add high-level, decimal.Decimal-based interface in shutil module.

larryhastings commented 13 years ago

I think a pair of integers is a poor API. It ties the value of the fractional part to nanoseconds. What happens when a future filesystem implements picosecond resolution? And then later goes to femtoseconds? Or some platform chooses another divisor (2**32)? This should all be abstracted away by the API, as the current API does. Otherwise you start sprinkling magic values in your code (ie 1e9). Suggesting that other representations (float128, Decimal) can be built on top of the (int,int) interface is irrelevant; obviously, any representation can be built on top of any other.

I think Decimal is reasonable, except that it breaks existing code. One cannot perform computation between a Decimal and a float, so almost any existing manipulations of atime/utime would start throw exceptions.

I suggest that a float128 type would solve the problem on all counts--nobody would have to change their code, and it would handle nanosecond (or I think even zeptosecond!) resolution. And when we run out of resolution, we can switch to float256. (Or an arbitrary-precision float if Python has one by then.)

os.stat added support for float atime/mtime in 2.3, specifically in October 2002: http://hg.python.org/cpython/rev/0bbea4dcd9be This predates both the inclusion of Decimal in Python (2.4) and nanosecond resolution in the utime API (2008). I could find no discussion of the change, so I don't know what other representations were considered. It's hard to say what the author of the code might have done if Decimal had existed back then, or if he foresaw nanosecond resolution.

However, that author is already present in this thread ;-) Martin?

5579dc13-9f48-42d1-bb17-9c003ef6fa70 commented 13 years ago

On Fri, Sep 9, 2011 at 4:50 PM, Larry Hastings \report@bugs.python.org\ wrote: ..

I think a pair of integers is a poor API. It ties the value of the fractional part to nanoseconds. What happens when a future filesystem implements picosecond resolution?

If history repeats, struct stat will grow new st_xtimesuperspec fields, st_xtimespec will become a macro expanding to st_xtimesuperspec.tv_picosec and we will get a request to support that in os.stat(). I don't see why this conflicts with stat_result.st_xtimespec returning a (sec, nsec) tuple. If we will ever have to support higher resolution, stat_result will grow another member with a (sec, picosec) or whatever will be appropriate value.

And then later goes to femtoseconds?

Same thing.

Or some platform chooses another divisor (2**32)?

Unlikely, but C API will dictate Python API if this happens.

5579dc13-9f48-42d1-bb17-9c003ef6fa70 commented 13 years ago

On Fri, Sep 9, 2011 at 5:22 PM, Alexander Belopolsky \report@bugs.python.org\ wrote: ..

If history repeats, struct stat will grow new st_xtimesuperspec fields, st_xtimespec will become a macro expanding to st_xtimesuperspec.tv_picosec

On the second thought, this won't work. To support higher resolution will need to supply 3 fields in st_xtimesuperspec: tv_sec and tv_nsec packed in st_xtimespec (say tv_timespec) and new tv_psec field. st_xtime will now be st_xtimesuperspec.tv_timespec.tv_sec and st_xtimespec will be a new macro expanding to st_xtimesuperspec.tv_timespec. The rest of my argument still holds.

larryhastings commented 13 years ago

To support higher resolution will need to supply 3 fields in st_xtimesuperspec: tv_sec and tv_nsec packed in st_xtimespec (say tv_timespec) and new tv_psec field. st_xtime will now be st_xtimesuperspec.tv_timespec.tv_sec and st_xtimespec will be a new macro expanding to st_xtimesuperspec.tv_timespec.

How is this superior to using either Decimal or float128?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

This predates both the inclusion of Decimal in Python (2.4) and nanosecond resolution in the utime API (2008). I could find no discussion of the change, so I don't know what other representations were considered. It's hard to say what the author of the code might have done if Decimal had existed back then, or if he foresaw nanosecond resolution.

However, that author is already present in this thread ;-) Martin?

I think I specifically expected that nanosecond resolution in the file system API will not be relevant ever, since a nanosecond is damned short. I also specifically wanted to support units of 100ns, since that is what NTFS used at that time (and still uses).

I also considered that introducing float would cause backwards incompatibilities, and provided the stat.float_times setting, and made only the indexed fields return ints, whereas the named fields contained floats. I think I would have chosen an arbitrary-precision fractional type had one been available. If a two-ints representation is considered necessary, I'd favor a rational number (numerator, denominator) over a pair (second, subsecond); this would also support 2**-32 fractions (as used in NTP !!!).

As yet another approach, I propose a set of st_[cma]timensec fields which always give an int representing the integral part of the time stamp in nanoseconds since the epoch. If sub-nanosecond time stamps ever become a reality, st[cma]time_asec fields could be added, for attosecond resolution.

larryhastings commented 13 years ago

I've drawn an ASCII table summarizing the proposals so far. If I've made any factual errors I trust you'll let me know.

=\<type> means os.stat().st_mtime is changed to that type. +\<type> means os.stat() grows a new field using that type, and the current behavior of st_mtime is unchanged.

[ UPSIDES ] =(int,int) +(int,int) =Decimal +Decimal =float128 [ yes is good! ]

all existing code gets no no no no yes more accurate for free

some existing code gets no no yes no yes more accurate for free

guaranteed future-proof against no no yes yes no* new representations

very very future-proof against no no yes yes yes* new representations

float128 could handle representations finer than yoctosecond resolution, 10-24, but not another 10-3. fwiw, yocto is currently the smallest defined prefix.

[ DOWNSIDES ] =(int,int) +(int,int) =Decimal +Decimal =float128 [ yes is bad! ]

breaks existing code yes no yes no no

requires new code in order to take advantage yes yes yes yes no of added precision

requires implementing a no no no no yes complicated new type

Since this option breaks existing code, obviously people will have to write new code in order to cope.

My take on the above: if we're willing to put people through the pain of changing their code to use the new accuracy, then Decimal is the obvious winner. I see no advantage to any of the pair-of-floats proposals over Decimal.

If we want all existing code to continue working and get more accurate automatically, the only viable option is float128 (or a multiple-precision float).

larryhastings commented 13 years ago

s/pair-of-floats/pair-of-ints/

Also, to be clear: yocto is the smallest defined SI prefix. And what I meant when I referred to 10-3 was, float128 could handle 10-24 but not 10**-27. According to my back-of-the-envelope calculations, float128 could accurately represent timestamps with yoctosecond resolution for another 650 years to come.

5579dc13-9f48-42d1-bb17-9c003ef6fa70 commented 13 years ago

On Fri, Sep 9, 2011 at 5:42 PM, Larry Hastings \report@bugs.python.org\ wrote: ..

How is this superior to using either Decimal or float128?

It is explicit about the units of time used. If we use named tuples and retain C API field names, stat_result.tv_atimespec.tv_sec will clearly mean number of seconds and stat_result.tv_atimespec.tv_nsec will clearly mean nanoseconds. Even if we use plain tuples, the convention will be obvious to someone familiar with C API. And familiarity with C API is expected from users of os module, IMO. Those who need higher level abstractions should use higher level modules.

5579dc13-9f48-42d1-bb17-9c003ef6fa70 commented 13 years ago

On Fri, Sep 9, 2011 at 6:18 PM, Larry Hastings \report@bugs.python.org\ wrote: ..

I've drawn an ASCII table summarizing the proposals so far.

You did not mention "closely matches C API" as an upside.

mdickinson commented 13 years ago

[about adding float128]

I realize a new float type would be a major undertaking

That's an understatement and a half. The only way this could ever be viable is if float128 support becomes widespread enough that we don't have to write our own algorithms for basic float128 operations; even then, it would still be a massive undertaking. MPFR provides these operations, but it's LGPL.

I don't see this happening in the forseeable future.

larryhastings commented 13 years ago

Mark Dickinson:

> I realize a new float type would be a major undertaking

That's an understatement and a half. The only way this could ever be viable is if float128 support becomes widespread enough that we don't have to write our own algorithms for basic float128 operations

As I mentioned earlier in this thread, GCC has supported __float128 since 4.3, Clang added support within the last year, and Intel has a _Quad type. All are purported to be IEEE 754-2008 quad-precision floats. Glibc added "quadmath.h" recently (maybe in 4.6), which defines sinq() / tanq() / etc. Is that not sufficient?

The Windows compilers don't seem to support a float128 yet. But Windows only supports 100ns resolution for mtime/atime anyway. The bad news: according to my back-of-the-envelope calcuations, doubles will stop accurately representing 100ns timestamps in less than a year; they'll lose another bit of precision somewhere around 2019.

Alexander Belopolsky

> How is this superior to using either Decimal or float128?

It is explicit about the units of time used.

Why is that a feature? I'd rather that was abstracted away for me, like the os module currently does.

And familiarity with C API is expected from users of os module, IMO.

As you say, that's your opinion. But I've never heard of that as an official policy. Therefore it is irrelevant as a design constraint for the API.

> I've drawn an ASCII table summarizing the proposals so far.

You did not mention "closely matches C API" as an upside.

By "C API", you mean "the POSIX-2008 compatible C API". This would not match

POSIX platforms that don't meet the 2008 spec
Windows
Java
CLR all of which are supported platforms for Python.

The documentation for the os module states: "This module provides a portable way of using operating system dependent functionality. [...] The design of all built-in operating system dependent modules of Python is such that as long as the same functionality is available, it uses the same interface"

Since "the most recent modification time / access time of a file" is available on all platforms Python supports, it follows that Python should use the same interface to represent it on all those platforms. Tying the representation to that of one particular platform is therefore poor API design, particularly when there are representations that abstract away such details within easy reach.

So I don't consider it a compelling upside--in fact I consider it a disadvantage.

vstinner commented 13 years ago

"As I mentioned earlier in this thread, GCC has supported __float128 since 4.3, Clang added support within the last year, and Intel has a _Quad type. All are purported to be IEEE 754-2008 quad-precision floats. Glibc added "quadmath.h" recently (maybe in 4.6), which defines sinq() / tanq() / etc. Is that not sufficient?"

Python is compiled using Visual Studio 2008 on Windows. Portability does matter on Python. If a type is not available on *all* platforms (including some old platforms, e.g. FreeBSD 6 or Windows XP), we cannot use it by default.

larryhastings commented 13 years ago

Victor STINNER:

Python is compiled using Visual Studio 2008 on Windows. Portability does matter on Python. If a type is not available on *all* platforms (including some old platforms, e.g. FreeBSD 6 or Windows XP), we cannot use it by default.

The quad-precision float would be highly portable, but not 100% guaranteed. The idea is that st_mtime would be a float128 on a recent Linux, but still a double on Windows. Python's "duck typing", combined with a judicious implementation of float128, would permit code that performed simple math on a timestamp to run unchanged. That should be sufficient for almost all existing code that deals with these timestamps.

But there are obvious, plausible scenarios where this would break. For example: pickling a float128 mtime on one platform and attempting to unpickle it on Windows. I can imagine legitimate reasons why one would want to ship ctime/atime/mtime across platforms.

If that's an unacceptable level of portability, then for the proposal to remain viable we'd have to implement our own portable quad-precision floating point type. That's a staggering amount of work, and I doubt it would happen. But short of some quad-precision type, there's no way to preserve existing code and have it get more precise automatically.

If float128 isn't viable then the best remaining option is Decimal. But changing st_mtime to Decimal would be an even more violent change than changing it to float was. I propose adding the Decimal fields "ctime", "atime", and "mtime" to the named tuple returned by os.stat().

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

The quad-precision float would be highly portable

Larry, please stop this discussion in this issue. I believe a PEP would be needed, and would likely be rejected because of the very very very long list of issues that can't be resolved. I think you seriously underestimate the problems. Please trust Mark on this.

For example, gcc doesn't support __float128 in 32-bit mode on x86.

If float128 isn't viable then the best remaining option is Decimal. But changing st_mtime to Decimal would be an even more violent change than changing it to float was. I propose adding the Decimal fields "ctime", "atime", and "mtime" to the named tuple returned by os.stat().

That sounds reasonable to me. While we are at it, I'd rename "ctime" to "creationtime" on Windows, to prevent people from believing it is "ctime" (i.e. inode change time).

vstinner commented 13 years ago

If a two-ints representation is considered necessary, I'd favor a rational number (numerator, denominator) over a pair (second, subsecond); this would also support 2**-32 fractions (as used in NTP !!!).

Which OS uses NTP timestamps in stat()? Or are you thinking about other functions?

As yet another approach, I propose a set of st_[cma]time_nsec fields which always give an int representing the integral part of the time stamp in nanoseconds since the epoch.

As I wrote before, I would prefer to keep the same number of fields in the Python structure and in the C structure, but I don't have a strong opinion on this choice. If we want to stay close the C API, we can use a namedtuple:

s = os.stat(filename, time_struct=True)
ctime = s.ctime.tv_sec + ctime + s.ctime.tv_nsec * 1-e9

Or maybe:

s = os.stat(filename, time_struct=True)
ctime = s.ctime.sec + ctime + s.ctime.nsec * 1-e9

A namedtuple is not a good idea if we want to support other time resolutions, because some developer may write "s.ctime[0] + ctime + s.ctime[1]" without taking care of the time resolution.

Because Windows uses a resolution of 100 ns and POSIX uses 1 ns, I still don't see why we should support something else. If we use the following API, we can still add other resolutions later (using a new argument):

s = os.stat(filename, nanoseconds=True)
sec, nsec = s.ctime
ctime = sec + nsec * 1e-9

What should be done if the OS only has a resolution of 1 sec? Raise an exception, or use nsec=0? Same question if we add *_nsec fields: these fields are optional, or always present?

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

As I wrote before, I would prefer to keep the same number of fields in the Python structure and in the C structure, but I don't have a strong opinion on this choice.

I'm with Larry - exposing time fields as structured records is hostile to the programmer. It is a true pain in C to do any meaningful computation on timeval or timespec values. It may be a little more convenient in Python, but we should really attempt to expose the time stamps as single fields that support arithmetics and string conversion, else people will hate us for 500 years (or when we next need to revise this struct).

Nothing is *gained* by exposing structured time. People may see it as an advantage that this closely matches the POSIX spec, but all it does in reality is to confront people with platform differences for no true gain.

s = os.stat(filename, nanoseconds=True) sec, nsec = s.ctime ctime = sec + nsec * 1e-9

What should be done if the OS only has a resolution of 1 sec? Raise an exception, or use nsec=0? Same question if we add *_nsec fields: these fields are optional, or always present?

If we declare that stat_result has nanosecond resolution, then it should have that even on systems that only support second resolution (or 2-second resolution, like FAT).

mdickinson commented 13 years ago

I propose adding the Decimal fields "ctime", "atime", and "mtime" to the > named tuple returned by os.stat().

That would be an interesting precedent: I don't think there are many (any?) other places outside the 'decimal' module that deal with Decimal objects (excluding general purpose serialization protocols and the like).

Decimal's support for conversion to-and-from float (and comparison with float) is stronger than it used to be; I think this could work.

larryhastings commented 13 years ago

Mark Dickinson wrote:

I think this could work.

"could"? Oh ye of little faith!

Attached is a patch against a nice fresh trunk (2b47f0146639) that adds Decimal attributes "ctime", "mtime", and "atime" to the object returned by os.stat(). The functions that consume mtime and atime values (os.utime, os.futimes, os.lutimes, os.futimesat) now also accept Decimal objects for those values. My smoke-test using os.stat and os.utime works fine, and CPython passes the unit test suite with only the expected skippage. However, the patch isn't ready to be checked in; I didn't update the documentation, for one. I'd be happy to post the patch on Rietveld--just ask.

The implementation was complicated by the fact that Decimal is pure Python ;) I had to do some "build the ship in a bottle" work. Also, it's possible for os.stat to be called before the Python interpreter is really ready. So internally it has to gracefully handle an import error on "decimal".

Along the way I noticed some minor resource leaks, both in os.utime:

for Windows, if passed a Unicode filename it would leak "obwpath".
for non-Windows, if the call to the C function utime() failed it would leak "opath". I fixed these, along with a spelling error.

I also cleared up some sillyness. When built on non-Windows, extract_time etc. used nine places of precision; on Windows it only used six. Windows only calls extract_time for os.utime--the other functions that use extract_time aren't available on Windows. And in the Windows implementation of os.utime, it multiplied the time it got from extract_time by a thousand! This might even be throwing away some precision--not sure. Definitely it was cruft.

However, modifying this meant changing the Windows code, which I can't test! So I'm not 100% certain it's right.

Finally the bad news: this patch contributes a major performance regression on os.stat. On my laptop, timeit says os.stat takes 10x longer when building the three Decimal fields. My immediate thought: lazy-create them. This would mean some major brain surgery; I'd have to make a subtype of PyStructSequence and override... something (tp_getattr? tp_getattro?). (Though this might also neatly ameliorate the early-startup import problem above.) I'd also have to hide the exact integers in the object somewhere--but since I'd be subclassing anyway this'd be no big deal.

My second thought: maybe one of the other Decimal constructors is faster? I'm currently using the "parse a string" form. My guess is, one of the other forms might be faster but not by an order of magnitude.

Martin van Löwis wrote:

For example, gcc doesn't support __float128 in 32-bit mode on x86.

That was only true for GCC 4.3. GCC 4.4 and newer support __float128 in 32- and 64-bit modes on Intel. That release has been out for more than two years.

But consider the matter dropped ;-)

5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 13 years ago

BTW, what is the status of cdecimal?

I just wrote the same in another issue, but not everyone is subscribed to that:

I think cdecimal is finished and production ready. The version in

http://hg.python.org/features/cdecimal#py3k-cdecimal

is the same as what will be released as cdecimal-2.3 in a couple of weeks. cdecimal-2.3 has a monumental test suite against *both* decimal.py and decNumber. The test suite no longer finds any kind of (unknown) divergence between decimal.py, cdecimal and decNumber.

Tests for cdecimal-2.3 have been running on 6 cores for more than half a year.

In short, review would be highly welcome. ;)

larryhastings commented 13 years ago

Can I get some thoughts / votes on whether to a) check in with the current performance regression, or b) do the work to make it lazy-created?

rhettinger commented 13 years ago

[Arfrever Frehtes Taifersar Arahesis]

I suggest to have low-level, POSIX-compatible, (int, int)-based interface in os module and add high-level, decimal.Decimal-based interface in shutil module.

I agree that this is the cleanest approach. Ideally, the os module stays as close as possible to the underlying structures. Also, it is desirable to keep it fast (not importing a pure python decimal module as a side-effect of checking the a timestamp -- making everyone pay the cost for a feature that few people will want or need).

With respect to the options suggested by MvL, I support adding new named fields and opposed to using a flag to indicate a type change (that would be error-prone).

If new fields as added, their names need to follow the existing naming convention (st_variable).

-1 on the patch as currently proposed. I don't think the performance impact is acceptable.

rhettinger commented 13 years ago

One other thought: it would be useful to research how nanosecond-resolution timestamps are going to be supported in other languages.

vstinner commented 12 years ago

Attached patch prepares time.wallclock() to be able to return the result as an integer (seconds, nanoseconds).

vstinner commented 12 years ago

With the new function time.wallclock() and time.clock_gettime() (issue bpo-10278), and maybe time.monotonic() will maybe be also added (issue bpo-13846), I now agree that it is important to support t2-t1 to compute a difference. Using a tuple, it's not easy to compute a difference.

time.wallclock(), time.clock_gettime() and time.monotonic() have a nanosecond resolution on Linux. Using issue bpo-13845, time.time() will have a resolution of 100 ns on Windows.

80036ac5-bb84-4d39-8416-02cd8e51707d commented 12 years ago

st_atim, st_ctim and st_mtim attributes could be instances of a class (implemented in posixmodule.c) similar to:

class timespec(tuple):
    def __init__(self, arg):
        if not isinstance(arg, tuple):
            raise TypeError
        if len(arg) != 2:
            raise TypeError
        if not isinstance(arg[0], int):
            raise TypeError
        if not isinstance(arg[1], int):
            raise TypeError
        self.sec = arg[0]
        self.nsec = arg[1]
        tuple.__init__(self)
    def __add__(self, other):
        if not isinstance(other, timespec):
            raise TypeError
        ns_sum = (self.sec * 10 ** 9 + self.nsec) + (other.sec * 10 ** 9 + other.nsec)
        return timespec(divmod(ns_sum, 10 ** 9))
    def __sub__(self, other):
        if not isinstance(other, timespec):
            raise TypeError
        ns_diff = (self.sec * 10 ** 9 + self.nsec) - (other.sec * 10 ** 9 + other.nsec)
        return timespec(divmod(ns_diff, 10 ** 9))

python / cpython

os.stat(): add new fields to get timestamps as Decimal objects with nanosecond resolution #55666