irods / irods

Open Source Data Management Software
https://irods.org
BSD 3-Clause "New" or "Revised" License
446 stars 141 forks source link

document iquest attrs #6660

Open mcast opened 2 years ago

mcast commented 2 years ago

Feature

To the extent that this is a feature request,

  1. I would like clear & discoverable documentation explaining what (at least some of) the different iquest attrs | grep TIME attributes mean.
  2. I would like metadata recording the time of the iput , and for that to be preserved across (most?) other operations.

I thought I had that so I'm going to continue as for a bug -

OR

Bug Report

iRODS Version, OS and Version

Ubuntu 18.04.6 LTS (bionic), iRODS Version 4.2.7

What did you try to do?

I tried to use DATA_CREATE_TIME as a time indicating when the file was iput.

In between the iput and my question, there have been multiple irepl / itrim operations.

This shellscript illustrates the issue,

#! /bin/sh

stamped_command() {
    echo $( date +%s.%N )": $@"
    "$@"
    date +%s.%N
    echo

    sleep 10
}

stamped_command  sleep 2

stamped_command  iput world.txt
stamped_command  irepl -R red world.txt

iquest "select DATA_CREATE_TIME, DATA_MODIFY_TIME, DATA_RESC_HIER, COLL_NAME, DATA_NAME where DATA_NAME = 'world.txt' "

stamped_command  irepl -R purple world.txt

# move to a different green
stamped_command  itrim -N2 -S green world.txt
stamped_command  irepl     -R green world.txt

# move to a different red
stamped_command  itrim -N2 -S red world.txt
stamped_command  irepl     -R red world.txt

iquest "select DATA_CREATE_TIME, DATA_MODIFY_TIME, DATA_RESC_HIER, COLL_NAME, DATA_NAME where DATA_NAME = 'world.txt' "

The sleeps are just to space the timestamps out enough to be clearly distinguishable.

Expected behavior

I expected one of the times shown in the first iquest output to continue to be available in the second iquest.

If there is such a time value stored, please can it be documented more clearly?

Observed behavior (including steps to reproduce, if applicable)

$ echo hello > world.txt
$ ~/irods-times.sh
1666707107.227801886: sleep 2
1666707109.230709353

1666707119.233119925: iput world.txt
1666707119.678681099

1666707129.680878383: irepl -R red world.txt
1666707130.398442783

DATA_CREATE_TIME = 01666707119
DATA_MODIFY_TIME = 01666707119
DATA_RESC_HIER = green;greenrandom;green1;irods-cgp-sr15-dev-sdb
COLL_NAME = /cgp_dev/home/mca2
DATA_NAME = world.txt
------------------------------------------------------------
DATA_CREATE_TIME = 01666707130
DATA_MODIFY_TIME = 01666707130
DATA_RESC_HIER = red;redrandom;red1;irods-cgp-dev-e02-sdc
COLL_NAME = /cgp_dev/home/mca2
DATA_NAME = world.txt
------------------------------------------------------------
1666707140.550315786: irepl -R purple world.txt
1666707140.974399906

1666707150.976392599: itrim -N2 -S green world.txt
Total size trimmed = 0.000 MB. Number of files trimmed = 1.
1666707151.438445844

1666707161.440624781: irepl -R green world.txt
1666707161.895416456

1666707171.897411887: itrim -N2 -S red world.txt
Total size trimmed = 0.000 MB. Number of files trimmed = 1.
1666707172.322684986

1666707182.325646001: irepl -R red world.txt
1666707183.010352472

DATA_CREATE_TIME = 01666707140
DATA_MODIFY_TIME = 01666707140
DATA_RESC_HIER = purple;purplerandom;purple1;irods-cgp-sr15-dev-sda
COLL_NAME = /cgp_dev/home/mca2
DATA_NAME = world.txt
------------------------------------------------------------
DATA_CREATE_TIME = 01666707161
DATA_MODIFY_TIME = 01666707161
DATA_RESC_HIER = green;greenrandom;green1;irods-cgp-sr15-dev-sde
COLL_NAME = /cgp_dev/home/mca2
DATA_NAME = world.txt
------------------------------------------------------------
DATA_CREATE_TIME = 01666707182
DATA_MODIFY_TIME = 01666707182
DATA_RESC_HIER = red;redrandom;red1;irods-cgp-dev-e02-sdb
COLL_NAME = /cgp_dev/home/mca2
DATA_NAME = world.txt
------------------------------------------------------------

Complications

I didn't get into imv or anything which might modify the contents of the file.

mcast commented 2 years ago

(@kript: sorry I didn't get this in for the meeting last week. The local interrupt generator is not maskable.)

trel commented 2 years ago

Data objects to not have timestamps.... replicas have timestamps.

Does that shift your mental model enough that a solution to your needs/assumptions presents itself?

You trimmed the original replica, so the original timestamp is now gone. You could always attach an AVU (at the data object level) with markings/timings of your choosing.

mcast commented 2 years ago

On Tue, Oct 25, 2022 at 09:03:11AM -0700, Terrell Russell wrote:

Data objects to not have timestamps.... replicas have timestamps.

Does that shift your mental model enough that a solution to your needs/assumptions presents itself?

That's pretty much where it had shifted to: the built-in metadata cannot answer my question. I was looking at "how much data used in the last year?" for purchase forecasts, and due to the irepl/itrim runs for hardware refresh I got two very different answers.

What remains is a name which I found misleading, plus a lack of documentation to avoid the confusion before finding out the hard way.

If it were called REPLICA_CREATE_TIME then I would expect it.

If it were called FILE_CREATE_TIME then I think the current behaviour would be clearly incorrect.

In the absence of other guidance the name DATA_CREATE_TIME lands somewhere between; then the existence of DATA_MODIFY_TIME suggests there is more complex behaviour available.

What is the difference between DATA_CREATE_TIME and DATA_MODIFY_TIME ? I've never noticed them different but we don't do anything exciting with our pebibytes.

You trimmed the original replica, so the original timestamp is now gone.

Yes, this was done to the earliest bunch of files - stored 5 years ago.

You could always attach an AVU (at the data object level) with markings/timings of your choosing.

Fortunately we do that, indirectly, and it's how I got two different answers.

We have an AVU on each file which is foreign key to an external (LIMS) database table. The PK is monotonic-ish, the way sequences are with caching, and in the table we record file creation time.

(( Re: the script attached to OP, we don't leave a replica in the "purple tree" after doing those operations. In the production process, resources for hardware refresh are renamed from (red xor green) to purple. This causes irepl onto the necessary colour to return to 1 red + 1 green; then the purple are sent to the retirement pasture / glue boiler. ))

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

trel commented 2 years ago

Agreed on all counts - the attrs token name takes its name from the table in which it needs to be linked/found. r_data_main -> DATA_CREATE_TIME

If the data object table was split from the replica table... then yes, this would have been clearer.

So, for now, yes, your workaround/solution seems the best without an explicit timestamp AVU getting minted upon initial upload.

I'll leave this issue open as 'documentation' for us to think through describing in more detail the iquest attrs tokens.

mcast commented 2 years ago

Thanks @trel

alanking commented 6 months ago

I'm working on adding a GenQuery page here: https://github.com/irods/irods_docs/pull/251 We can add explanations to the GenQuery attributes as a next step.