Add filters in DID search

ftorradeflot commented 6 months ago

The DID search in the jupyterlab extension only takes into account three parameters: name, scope and type.

It would be useful to be able to filter the DIDs by user-defined metadata. This could be done by adding an additional "filters" parameter in the DID search and forwarding it to the "filters" argument in the /dids/{scope}/dids/search REST API endpoint that relies on rucio.core.did.list_dids method.

The "filters" argument is not listed in the documentation of the REST API method, but it seems to be there when looking into the underlying code

An effort to implement this feature has been started in this fork

GeorgySk commented 6 months ago

Hi, all!

The author of the fork here. You can find a prototype of the interface located here: link.

As I've mentioned previously on the Mattermost, I've been trying to integrate this interface in the extension but I've been having some issues with it. It seems that the functionality that should fire after clicking the newly introduced button is simply not there.

After more digging, I discovered to my surprise that now I cannot even remove the button... It's just stuck there and even removing the code associated with it doesn't do anything. I have checked if the extension gets built correctly, I've also checked the build directory and the timestamps of the newly generated files, and I've checked the source code in the rebuilt Docker container to see if it gets updated correctly. Nothing so far points me to the source of the issue. I thought that it could be a browser caching issue but opening the JupyterLab in an incognito mode shows that the button is still there. Baffling. I'm sure I'm making a silly mistake somewhere but I cannot comprehend what it is at the moment.

Any insights on this issue are welcome.

P.S.: As I was advised, I've rebased my fork onto the master branch of this fork made by @garciagenrique.

GeorgySk commented 6 months ago

Ok, I've figured it out. It was a blunder from my side. A typo in the rucio-jupyterlab Docker image name (at some point I've started tagging it as rucio_jupyterlab for some reason). I can advance now to adding the filtering functionality itself, and I'll also have to do some polishing of the interface as the styles are all over the place.

GeorgySk commented 4 months ago

The functionality for filtering the DIDs based on the specified filters of user metadata is now added. Feel free to test it. There is still some work to be done like improving style of the interface and some minor code refactoring, though.

There are a couple of things which are important to mention. If you'll look at the code, you will notice that I've added a function parse_did_filter_from_string_fe in rucio_jupyterlab/rucio/rucio.py, https://github.com/GeorgySk/jupyterlab-extension/blob/a6fa76615592131bc09f5c91295dadbd7e7b6601/rucio_jupyterlab/rucio/rucio.py#L20. This function is actually copied from the Rucio code here: https://github.com/rucio/rucio/blob/d7c7645b10dd6b904108bd94bd48f6ce14a20d90/lib/rucio/common/utils.py#L1455. I needed that function for conversion of strings from, e.g., key1=value1;key2=value2,key3=value3 into lists of dicts like [{'key1': 'value1'}, {'key2': 'value2', 'key3': 'value3'}]. This code duplication could be avoided by either parsing the state of the interface into a list of dicts directly, or by fixing, what I believe, an erroneous logic in the Rucio repository at https://github.com/rucio/rucio/blob/d7c7645b10dd6b904108bd94bd48f6ce14a20d90/lib/rucio/core/did_meta_plugins/__init__.py#L198 where it is assumed that the filters are a list of dicts before they have a chance to be converted with the aforementioned function.

Another issue is that the JSON plugin doesn't return some attributes like a DID type, and file size, which are required if we want to display the DIDs correctly in the panel with the search results. As a workaround, I've added a method get_metadata to retrieve the regular metadata here: https://github.com/GeorgySk/jupyterlab-extension/blob/master/rucio_jupyterlab/rucio/rucio.py#L196 and I call it for every DID that lacks the required information here: https://github.com/GeorgySk/jupyterlab-extension/blob/a6fa76615592131bc09f5c91295dadbd7e7b6601/rucio_jupyterlab/handlers/did_search.py#L39-L44. I'm not entirely sure if this is a correct approach.

Also, something that I haven't thought about so far is the tests. I'll have to take a look at that.

garciagenrique commented 4 months ago

Hello @GeorgySk Very good news ! I will try to give a try these days, but don't hesitate to create a PR once everything is ready to check that the test are passing correctly !

ftorradeflot commented 3 months ago

I downloaded your version and gave it a try. It seems to work properly. The only thing I found is that errors are not properly handled. For example:

I add attr1=1 to a file with rucio set-metadata --did test:file1 --key attr1 --value 1
try to get it from the client rucio -vv list-dids test:* --filter "type=ALL,attr1=1" I get this error:

2024-08-20 10:39:17,470 ERROR Provided metadata is considered invalid. Details: Filter keys used do not all belong to the same metadata plugin.

when I try from the extension I get a "No results found" message.

I haven't yet looked into the tests or the documentation.

Thank you very much. Great effort!

ftorradeflot commented 3 months ago

The problem with Rucio API error handling is probably not related to this feature, but something general throughout the exension that should be improved.

ftorradeflot commented 3 months ago

I created a PR with some changes to be merged in @GeorgySk 's branch, mainly test fixes: https://github.com/GeorgySk/jupyterlab-extension/pull/1

I only fixed the already existing tests and included a minimal test on a React component. It would be good to include some tests additional tests specifically on the usage of metadata filters on the python side.

From the implementation pov I wonder if it would be possible to have another React component (e.g. MetadataFilterList), containing the list of metadata filters and the methods to manage them, in a separate module, instead of having everything in ExploreTab.tsx. This would also ease testing the addition, removal of new filters.

I don't think these improvements should be blocking in case there's an urgent need to have this feature in production. But, if there's no hurry, I'd rather implement them.

GeorgySk commented 2 months ago

Thanks for the PR, @ftorradeflot! I will get back to you with a feedback on it later.

Me and Andres Tanasijczuk also discussed the new code on Slack, and he had some comments on it as well. I will post them here to have everything in one place.

Here json.dumps is added but it seems to be not necessary as the Rucio client code doesn't have it as can be seen here and here.
The name parameter from here probably should be added to the filters instead, like filters, _ = parse_did_filter_from_string_fe(filters, name). This is because right now the filters do not work together with specifying a name of a DID. The filters take precedence.
As it was mentioned before, get_metadata gets called for each DID returned by the search which is not a good thing. The get_metadata method is adding did_type , bytes and length, which are the three properties that the json_meta plugin is not returning compared to the did_column_meta plugin. We might want to ignore the length property as it is probably not used by the JupyterLab extension. We could probably also define an empty string as a default value for bytes?.. And about the did_type , we could propose to modify the json_meta plugin by adding a line models.DIDMeta.did_type here so that it returns also did_type, since it seems to be available in the DIDMeta class (see here). We could then check in the extension if there is a returned type. If there is not, we call get_metadata only if the search type is datasets and containers or everything; if the search type is files, datasets or containers we'd assume the returned type is just the search type.

GeorgySk commented 2 months ago

Regarding the first point about the json.dumps. Indeed, it can be removed and the result will be the same. This is because, urlencode will convert the filters to a string under a hood. And while that string will be not a valid JSON, it won't matter, since it will be converted back to a list of dicts not by json.loads as I incorrectly assumed but by ast.literal_eval.

One thing to note here is that the extension wraps all the filter keys and values in quotes regardless of if they are numeric or not. Numbers will then be casted to numeric types by calls to ast.literal_eval here, and the strings... will cause those calls to throw ValueErrors as the strings themselves do not contain quotes. The absence of inner quotes can break things if, for example, a string is not a valid Python syntax. For example, a filter key/value with a space in it, e.g. a b will cause ast.literal_eval to throw SyntaxError which won't be caught. To fix this, we should probably have the extension adding a second set of quotes to non-numeric strings.

GeorgySk commented 2 months ago

I've just pushed a fix for that issue with string values that are not valid Python syntax. Now all non-numeric keys/values get additional quotes that help preventing unwarranted SyntaxErrors in the Rucio code. Importantly, I've made an assumption that the keys cannot be numeric. IMHO, that's a safe assumption, but I wanted to ask what you think about it.

GeorgySk commented 2 months ago

@ftorradeflot, I've just pushed a couple of more changes:

The search now works correctly when both filters and DID are specified.
Styles have also been fixed, and now switching between light and dark themes works correctly (previously the colors didn't change).

I've also looked into displaying errors for the case when there is a mismatch between types of filter values (e.g. when a file has a metadata value of a string type, but in the search field we specify a numeric value). But it seems like some changes should be introduced to the Rucio's code first. From what I understand, it doesn't handle response 500 correctly here. Instead, I'm getting dids like this:

[{'ExceptionClass': 'InvalidMetadata', 'ExceptionMessage': 'Database query failed: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type double precision: "b"\n\n[SQL: SELECT dev.did_meta.name AS dev_did_meta_name, dev.did_meta.scope AS dev_did_meta_scope \nFROM dev.did_meta \nWHERE CAST((dev.did_meta.meta ->> %(meta_1)s) AS FLOAT) = %(param_1)s AND dev.did_meta.scope = %(scope_1)s]\n[parameters: {\'meta_1\': \'a\', \'param_1\': 3, \'scope_1\': \'test\'}]\n(Background on this error at: https://sqlalche.me/e/20/9h9h). This can be raised when the datatype of a key is inconsistent between dids.'}]

If this is fixed, then I'd only have to add a couple of lines here to display an error instead of just "No results found".

In principle, I've finished with the code. I'm not sure what else is left to do. I mentioned that the JSON plugin doesn't return some of the attributes, and that I had to add calls to Rucio to retrieve regular metadata for each DID. But, in the end, I don't think I can do anything on the side of the extension, as, for example, I cannot simply ignore the size of the file since it is actually displayed in the panel with the results.

Should I create a pull request?

ftorradeflot commented 1 day ago

I've just pushed a fix for that issue with string values that are not valid Python syntax. Now all non-numeric keys/values get additional quotes that help preventing unwarranted SyntaxErrors in the Rucio code. Importantly, I've made an assumption that the keys cannot be numeric. IMHO, that's a safe assumption, but I wanted to ask what you think about it.

I second the assumption

ftorradeflot commented 1 day ago

@ftorradeflot, I've just pushed a couple of more changes:

The search now works correctly when both filters and DID are specified.

Styles have also been fixed, and now switching between light and dark themes works correctly (previously the colors didn't change).

I've also looked into displaying errors for the case when there is a mismatch between types of filter values (e.g. when a file has a metadata value of a string type, but in the search field we specify a numeric value). But it seems like some changes should be introduced to the Rucio's code first. From what I understand, it doesn't handle response 500 correctly here. Instead, I'm getting dids like this:
[{'ExceptionClass': 'InvalidMetadata', 'ExceptionMessage': 'Database query failed: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type double precision: "b"\n\n[SQL: SELECT dev.did_meta.name AS dev_did_meta_name, dev.did_meta.scope AS dev_did_meta_scope \nFROM dev.did_meta \nWHERE CAST((dev.did_meta.meta ->> %(meta_1)s) AS FLOAT) = %(param_1)s AND dev.did_meta.scope = %(scope_1)s]\n[parameters: {\'meta_1\': \'a\', \'param_1\': 3, \'scope_1\': \'test\'}]\n(Background on this error at: https://sqlalche.me/e/20/9h9h). This can be raised when the datatype of a key is inconsistent between dids.'}]
If this is fixed, then I'd only have to add a couple of lines here to display an error instead of just "No results found".

In principle, I've finished with the code. I'm not sure what else is left to do. I mentioned that the JSON plugin doesn't return some of the attributes, and that I had to add calls to Rucio to retrieve regular metadata for each DID. But, in the end, I don't think I can do anything on the side of the extension, as, for example, I cannot simply ignore the size of the file since it is actually displayed in the panel with the results.

Should I create a pull request?

Yes, @GeorgySk I think you can create a pull request

rucio / jupyterlab-extension

Add filters in DID search #36