irods / irods

Open Source Data Management Software
https://irods.org
BSD 3-Clause "New" or "Revised" License
440 stars 141 forks source link

add count-distinct() to genQuery #3671

Open trel opened 6 years ago

trel commented 6 years ago

The current genQuery logic (4.2.1) does not support the distinct keyword within the count() construct. This prevents our ability to get a total of the number of data_objects under management without counting all the replicas or specifying resource hierarchy information.

$ ils -L
/tempZone/home/rods:
  rods              0 demoResc          224 2017-09-26.11:13 & foo
        generic    /var/lib/irods/Vault/home/rods/foo
  rods              0 demoResc          224 2017-09-27.09:03 & foo2
        generic    /var/lib/irods/Vault/home/rods/foo2
  rods              1 AnotherResc          224 2017-09-27.09:03 & foo2
        generic    /tmp/irods/AnotherResc/home/rods/foo2
  rods              0 demoResc          166 2017-09-27.13:39 & foo3
        generic    /var/lib/irods/Vault/home/rods/foo3

$ iquest "select DATA_NAME where COLL_NAME like '/tempZone/home/rods%'"
DATA_NAME = foo
------------------------------------------------------------
DATA_NAME = foo2
------------------------------------------------------------
DATA_NAME = foo3
------------------------------------------------------------

$ iquest "select count(DATA_NAME) where COLL_NAME like '/tempZone/home/rods%'"
DATA_NAME = 4
------------------------------------------------------------

The 'distinct' keyword is automatically added to the first genQuery above, and the result is the three DATA_NAMEs returned.

Once the count() is added to the genQuery, the distinct is still applied, but outside the count(). I am interested in the result from this query (where distinct is within the count()), but it cannot currently be constructed via GenQuery:

select count(distinct d.data_name) from R_DATA_MAIN d

Suggested enhancement:

Create count-distinct keyword within genQuery to handle this type of request.

$ iquest "select count-distinct(data_name) where COLL_NAME like '/tempZone/home/rods%'"
DATA_NAME = 3
------------------------------------------------------------

This may also mean that it makes sense to add -distinct versions for some other aggregate keywords within genQuery:

It does not make any sense to add:

luijs commented 10 months ago

I ran into this today, so a +1 from me;). It also is a bit confusing because the distinct is included implicitly in the query(I found out because there was a no-distinct option), so at first you don't know why you get a different count then the number of outputs without doing a count, so it might be good to make it explicit instead.

korydraughn commented 10 months ago

We will address this issue in the new GenQuery2 parser.