indigo-dc / onedata

Indigo mirror of http://github.com/onedata/onedata
Apache License 2.0
1 stars 1 forks source link

Elastic search for data aggregations and queries #12

Open evasciacca opened 7 years ago

evasciacca commented 7 years ago

Dear developers,

We would like to understand how to perform elastic search for data aggregations and queries. We have done some tests on the previous OneData release 3.0.0 beta 7 using the direct connection to CouchBase on the bucket 'xattr'. In release 3.0.0 rc9 we did not find where and how metadata are stored on the CouchBase. It would be very useful for us to include n1ql in the containers so that we could access the n1ql GUI on port 8094 and test the queries. Any news on this topic?

Thank you,

Eva.

groundnuty commented 7 years ago

Dear Eva, Your request for including the n1ql refers to oneprovider, onezone docker images or both?

Michal

groundnuty commented 7 years ago

Dear Eva, Regarding the metadata in rc9, I was informed they were moved to: custom_metadata bucket, where id's are the same as for files and have 2 values (space_id,value), where value is of a form {key:value}, where key is a name of xattr and the value is it's value. I hope that helps.

Michal

On Fri, Dec 2, 2016 at 7:53 PM, Michal Orzechowski < orzechowski.michal@gmail.com> wrote:

Dear Eva, Your request for including the n1ql refers to oneprovider, onezone docker images or both?

Michal

evasciacca commented 7 years ago

Dear Michal,

thank you very much for your replies. Regarding n1ql it would be needed in oneprovider docker image (where metadata are stored) but if you think it could be useful for elastic search on metadata also on onezone image.

I will now investigate the new bucket custom_metadata.

Furthermore, I was also experimenting the advanced metadata queries (https://onedata.org/docs/doc/using_onedata/metadata.html#advanced-metadata-queries) using REST API. From which onedata release does it work?

Regards,

Eva.

groundnuty commented 7 years ago

Dear Eva, Onedata.org is atm. at 3.0.0-rc9. The newest version is always deployed: beta.oneda.org - atm. rc11 however the documentation is exactly the same in rc9 and rc11 for that particular feature: https://beta.onedata.org/docs/doc/using_onedata/metadata.html#advanced-metadata-queries

So it works for rc9 and on.

I will pass the requirement for n1ql in our upcoming Friday planing meeting.

Regards, Michal

On Mon, Dec 5, 2016 at 10:33 AM, evasciacca notifications@github.com wrote:

Dear Michal,

thank you very much for your replies. Regarding n1ql it would be needed in oneprovider docker image (where metadata are stored) but if you think it could be useful for elastic search on metadata also on onezone image.

I will now investigate the new bucket custom_metadata.

Furthermore, I was also experimenting the advanced metadata queries ( https://onedata.org/docs/doc/using_onedata/metadata.html# advanced-metadata-queries) using REST API. From which onedata release does it work?

Regards,

Eva.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/indigo-dc/onedata/issues/12#issuecomment-264806372, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOuppJB-GATMcLdvbCOtfG-wc8Zn_3hks5rE9pMgaJpZM4LABIZ .

evasciacca commented 7 years ago

Dear Michal,

are there any news regarding the inclusion of the n1ql for next releases?

We have updated OneData to release rc11 and we are testing the queries using the indexing functions over the metadata and we have successfully done some queries using them.

Now, we would like to create more complex queries including more than one index combination, e.g. to perform this kind of sql query:

SELECT FILEPATH/FILEID from FILE-BUCKET-ONEDATA where ORIGIN_ID='41' and PROGRAM_ID='001' and OBSERV_ID='00001' and TELESCOPE_ID='000' and DATE BETWEEN '%2014-08-20%' and '%2014-08-26%';

Do you know if it is possible and how to do this?

Thank you in advance,

Eva.

xorver commented 7 years ago

We've had a discussion on out hipchat support channel and we've come up with examples of such queries, which were successfully used by Eva. To summarize the topic, here are the examples provided:

environment

SPACE_ID=4dfVO2NTmZJJzrzIYOzUJkXFA8ndZRveThPgSscPbtA
TOKEN=MDAxNWxvY2F00aW9uIG9uZXpvbmUKMDAzYmlkZW500aWZpZXIgYXJSbF9HRWZ6WWpzNlFsR001PTG1QVl9Fb009yN1kzSVlaZXVONUtWRjhrNAowMDFhY2lkIHRpbWUgPCAxNTE4MDE3MjI4CjAwMmZzaWduYXR1cmUgNLd02VRyADxrt12Sk8d01oSYMzFIcsG2IuxjdzNaLky5AK

create file

curl -k -v --tlsv1.2 -H "X-Auth-Token: $TOKEN" -H "X-CDMI-Specification-Version: 1.1.1" -H "Accept: application/cdmi-object" -H "Content-Type: application/cdmi-object" \
-d '{"metadata": {"ORIGIN_ID": "41", "PROGRAM_ID": "001", "OBSERV_ID": "00001", "RUNS_ID": "000004", "MODES_ID": "R" , "SEQUENCE_NUM": "000", "PACKET_TYPE": "1002", "PROP_ID": "0000000000000001", "DATATYPE": "0000", "TSTART": "430580855", "TSTOP": "430580965" }}' \
-X PUT "https://localhost:8443/cdmi/test/file?metadata"

complexindex.js

function(meta) {
        if(meta['ORIGIN_ID'] && meta['PROGRAM_ID']) {
                return [meta['ORIGIN_ID'], meta['PROGRAM_ID']];
        }
        return null;
}

create index

curl -v -k --tlsv1.2 -X POST -H "X-Auth-Token: $TOKEN" -H 'Content-type: application/javascript' -d @complexindex.js "https://localhost:8443/api/v3/oneprovider/index?space_id=$SPACE_ID&name=index1"

query index

INDEX_ID=t2zLyYaR0HPHO8d6wICNYiRDqzc8lGyVVcGUlbepBWo
curl -v -k --tlsv1.2 -Ss -H "X-Auth-Token: $TOKEN" -H 'Content-type: application/javascript'  -X GET "https://localhost:8443/api/v3/oneprovider/query-index/$INDEX_ID?key=\[\"41\",\"001\"\]&stale=false"

Regards, Tomasz

groundnuty commented 7 years ago

@xorver thank you for providing this short guide. @evasciacca is information Tomasz supplied sufficient?

evasciacca commented 7 years ago

Dear Tomasz and Michal,

thank you very much, your support was very helpful. I was able to perform some test queries. Now we are waiting for further improvements such as the possibility to delete unused indexes that is foreseen for next OneData releases (maybe rc13).

Cheers,

Eva.

evasciacca commented 7 years ago

Dear Tomasz,

we would need to include in the query some information on the creation time of the dataset. Therefore I added to the complexindex.js the meta['ctime'] attribute but it is not working. Is this operation maybe working only on extended attributes and not on internal ones?

Thank you,

Eva.

xorver commented 7 years ago

Dear Eva,

Those indices work only on user created attributes, not the internal ones (as you suspected).

Regards, Tomasz Lichoń

evasciacca commented 7 years ago

Dear Tomasz,

thank you for your prompt replay. Would it be possible to add this possibility as enhancement for next OneData releases. Maybe the fastest way would be to add the internal attributes related to time creation/modification as extended ones (not user modifiable) on data ingestion / modification. Or maybe modify rest API to work with internal attributes...

Please let us know if you will make progresses in this direction!

Kind regards,

Eva.

xorver commented 7 years ago

Dear Eva,

Times are stored in a separate document, as they're changed frequently. We may possibly consider merging them with metadata document, but we've decided not to because the change of access time would provoke a lot of index updates. Moreover, there are other internal attributes that we cannot possibly add (such as size) because they are generated dynamically. Adding only a part of them would be confusing. There are also some compatibility issues that may be hard to overcome during an upgrade.

If you're only interested in creation time I would recommend adding it manually after creation. By the way ctime corresponds to attribute change time, not creation, so it may be newer if the file got modified after creation. I'll discuss with my colleagues whether to add automatically to each file xattr called creation_time, maybe it will not be a big deal.

Regards, Tomasz Lichoń

evasciacca commented 7 years ago

Thank you very much.

evasciacca commented 7 years ago

Dear Tomasz,

also we wold like to understand if it would be possible to use operators such as "not equal", "like" or "order by" and if these are planned to be implemented in next releases.

Thank you in advance,

Eva.

xorver commented 7 years ago

Dear Eva,

We're limited to the API provided by couchbase (http://docs.couchbase.com/admin/admin/Views/views-querying.html). We create an index directly in couchbase, wrapping your function with ours. Our 'map' function uses your function to get the key (based on metadata) and emits (your_key, file_id). Then every query is simply passed directly into couchbase (you define keys, flags etc, and obtain file_ids in return).

Couchbase queries do not support operations: "not equal", "like", "order by". But you may create an index with a function that emits only files that satisfy your requirements, e. g.

For emitting only files authored by 'Eva*' that are of version different than '3.0.0':

function(meta) {
        if(meta['VERSION'] && meta['AUTHOR'])
                if(meta['VERSION'] != '3.0.0' && meta['AUTHOR'].match(/Eva.*/)
                        return 1;
        return null;
}

Then query it with key=1 or without key at all (it will return all entries). However, for each query pattern you'd need to create a new index, and it does not cover "order by".

Regards, Tomasz Lichoń

groundnuty commented 7 years ago

@evasciacca can we consider this issue as closed or there are still things to be resolved/answered here?

evasciacca commented 7 years ago

Dear @groundnuty,

it is still not yet very clear how to use the new custom_metadata bucket from n1ql. I am not even able to see it from the couchbase console (see attached screenshot). Could you please provide more information on how to query metadata from the couchbase db to retrive the datasets using more complex queries directly on the metadata bucket?

Thank you in advance,

Eva. cb_console

xorver commented 7 years ago

Dear Eva,

As discussed in Catania, for now, we will try to add FileId to each custom_metadata document, so you may query Couchbase directly. It should be available in rc15, I'll write info here when it's done.

Tomasz

evasciacca commented 7 years ago

Dear @xorver , thank you very much, also could you please provide us a sample CouchBase n1ql query to understand how to reach the correct custom_metadata document? Cheers, Eva.

xorver commented 7 years ago

I have an example for you, I've entered the machine with running couchbase, then executed:

root@couchbase0-p1:~# /opt/couchbase/bin/cbq
Couchbase query shell connected to http://localhost:8093/ . Type Ctrl-D to exit.
cbq> CREATE PRIMARY INDEX sync ON sync;
{
    "requestID": "126f23f8-c507-425f-8932-d672d7ad8209",
    "signature": null,
    "results": [
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "3.763397768s",
        "executionTime": "3.763321304s",
        "resultCount": 0,
        "resultSize": 0
    }
}
cbq> SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata";
{
    "requestID": "97a0abc4-ecad-43b4-8e69-40173e75f926",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": "false",
            "_sync": {
                "history": {
                    "channels": [
                        null
                    ],
                    "parents": [
                        -1
                    ],
                    "revs": [
                        "1-64a3bd48eccc930028116168fb118ae0"
                    ]
                },
                "recent_sequences": [
                    142
                ],
                "rev": "1-64a3bd48eccc930028116168fb118ae0",
                "sequence": 142,
                "time_saved": "2017-05-19T10:15:36.656732653Z"
            },
            "space_id": "s1",
            "value": {
                "onedata_json": {
                    "key": []
                },
                "onedata_rdf": "",
                "xattr_name": "xattr_value"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "28.257026ms",
        "executionTime": "28.126285ms",
        "resultCount": 2,
        "resultSize": 1933
    }
}

Metadata is in "value" field of fetched json. In rc15 there will be a file_id in that document also.

I guess you may do it using libs provided by couchbase or REST API as well.

evasciacca commented 7 years ago

Dear @xorver , thank you very much. I was able to access the custom_metadata. Now I would like to understand how to query a metadata included in the "value" field to get all documents having a specific metadata value. Here a sample output of my results:

cbq> SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata";
{
    "requestID": "dfda8e15-b136-43f6-b6cf-6881a429f3f7",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": true,
            "_sync": {
                "flags": 1,
                "history": {
                    "channels": [
                        null,
                        null
                    ],
                    "deleted": [
                        1
                    ],
                    "parents": [
                        -1,
                        0
                    ],
                    "revs": [
                        "1-be3de124785f5d522b751fc21c95c53e",
                        "2-7a9e86db4a997718f9fc75e22db130a8"
                    ]
                },
                "recent_sequences": [
                    100,
                    141
                ],
                "rev": "2-7a9e86db4a997718f9fc75e22db130a8",
                "sequence": 141,
                "time_saved": "2017-04-13T10:03:07.635264976Z"
            },
            "space_id": "Y37xOXwev7zfOP7Gie0zs-ZNQLYHngluF5Xx_Ar3kWI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000005",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        },
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": "false",
            "_sync": {
                "history": {
                    "channels": [
                        null,
                        null
                    ],
                    "parents": [
                        -1,
                        0
                    ],
                    "revs": [
                        "1-c20a06ee14f149db59e6c7b7973f3f5a",
                        "2-116eb453742dc83ee8731356b0163934"
                    ]
                },
                "recent_sequences": [
                    221,
                    227
                ],
                "rev": "2-116eb453742dc83ee8731356b0163934",
                "sequence": 227,
                "time_saved": "2017-04-14T09:19:30.18638156Z"
            },
            "space_id": "cZLEIUmL5Wi6kw53qpUVXWGA5a_2P8nY4tlvE8U3IgI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000004",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        },
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": "false",
            "_sync": {
                "history": {
                    "channels": [
                        null
                    ],
                    "parents": [
                        -1
                    ],
                    "revs": [
                        "1-c20a06ee14f149db59e6c7b7973f3f5a"
                    ]
                },
                "recent_sequences": [
                    226
                ],
                "rev": "1-c20a06ee14f149db59e6c7b7973f3f5a",
                "sequence": 226,
                "time_saved": "2017-04-14T09:19:02.874373927Z"
            },
            "space_id": "cZLEIUmL5Wi6kw53qpUVXWGA5a_2P8nY4tlvE8U3IgI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000005",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        },
        {
            "\u003crecord_type\u003e": "custom_metadata",
            "\u003crecord_version\u003e": 1,
            "_deleted": true,
            "_sync": {
                "flags": 1,
                "history": {
                    "channels": [
                        null,
                        null
                    ],
                    "deleted": [
                        0
                    ],
                    "parents": [
                        1,
                        -1
                    ],
                    "revs": [
                        "2-42f8698fad6d55be0a100aa39155ce0e",
                        "1-db14a5de948411595295684b189a8931"
                    ]
                },
                "recent_sequences": [
                    66,
                    134
                ],
                "rev": "2-42f8698fad6d55be0a100aa39155ce0e",
                "sequence": 134,
                "time_saved": "2017-04-13T10:02:58.147177312Z"
            },
            "space_id": "Y37xOXwev7zfOP7Gie0zs-ZNQLYHngluF5Xx_Ar3kWI",
            "value": {
                "DATATYPE": "0000",
                "MODES_ID": "R",
                "OBSERV_ID": "00001",
                "ORIGIN_ID": "41",
                "PACKET_TYPE": "1002",
                "PROGRAM_ID": "001",
                "PROP_ID": "0000000000000001",
                "RUNS_ID": "000004",
                "SEQUENCE_NUM": "000",
                "TSTART": "430580855",
                "TSTOP": "430580965"
            }
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "55.515451ms",
        "executionTime": "55.481407ms",
        "resultCount": 4,
        "resultSize": 5756
    }
}
xorver commented 7 years ago

You need to extend the "where" clause just as in typical SQL, e. g.

SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata" AND sync.value.DATATYPE = "0000";

The syntax: http://www.dofactory.com/sql/where-and-or-not

evasciacca commented 7 years ago

Yes I have tried this before but with no success:

cbq> SELECT sync.* FROM sync where sync.`<record_type>` = "custom_metadata" AND sync.value.ORIGIN_ID = "41";
{
    "requestID": "2e6be4ab-a994-4678-9e04-27b946a5f2c5",
    "errors": [
        {
            "code": 3000,
            "msg": "syntax error - at value"
        }
    ],
    "status": "fatal",
    "metrics": {
        "elapsedTime": "2.41933ms",
        "executionTime": "2.327075ms",
        "resultCount": 0,
        "resultSize": 0,
        "errorCount": 1
    }
}
xorver commented 7 years ago

Try it like this:

SELECT sync.* FROM sync WHERE sync.`<record_type>` = "custom_metadata" AND sync.`value`.`name` = "something";

It worked with my setup.

evasciacca commented 7 years ago

Great, thank you very much! It works now... so from next release r15 we will automatically find a file_id from which we can retrieve the data file, is it correct?

Thank you,

Eva.

xorver commented 7 years ago

Yes, it's correct.