Open ftorradeflot opened 6 months ago
Hi, all!
The author of the fork here. You can find a prototype of the interface located here: link.
As I've mentioned previously on the Mattermost, I've been trying to integrate this interface in the extension but I've been having some issues with it. It seems that the functionality that should fire after clicking the newly introduced button is simply not there.
After more digging, I discovered to my surprise that now I cannot even remove the button... It's just stuck there and even removing the code associated with it doesn't do anything. I have checked if the extension gets built correctly, I've also checked the build
directory and the timestamps of the newly generated files, and I've checked the source code in the rebuilt Docker container to see if it gets updated correctly. Nothing so far points me to the source of the issue. I thought that it could be a browser caching issue but opening the JupyterLab in an incognito mode shows that the button is still there. Baffling. I'm sure I'm making a silly mistake somewhere but I cannot comprehend what it is at the moment.
Any insights on this issue are welcome.
P.S.: As I was advised, I've rebased my fork onto the master branch of this fork made by @garciagenrique.
Ok, I've figured it out. It was a blunder from my side. A typo in the rucio-jupyterlab
Docker image name (at some point I've started tagging it as rucio_jupyterlab
for some reason). I can advance now to adding the filtering functionality itself, and I'll also have to do some polishing of the interface as the styles are all over the place.
The functionality for filtering the DIDs based on the specified filters of user metadata is now added. Feel free to test it. There is still some work to be done like improving style of the interface and some minor code refactoring, though.
There are a couple of things which are important to mention. If you'll look at the code, you will notice that I've added a function parse_did_filter_from_string_fe
in rucio_jupyterlab/rucio/rucio.py
, https://github.com/GeorgySk/jupyterlab-extension/blob/a6fa76615592131bc09f5c91295dadbd7e7b6601/rucio_jupyterlab/rucio/rucio.py#L20. This function is actually copied from the Rucio code here: https://github.com/rucio/rucio/blob/d7c7645b10dd6b904108bd94bd48f6ce14a20d90/lib/rucio/common/utils.py#L1455. I needed that function for conversion of strings from, e.g., key1=value1;key2=value2,key3=value3
into lists of dicts like [{'key1': 'value1'}, {'key2': 'value2', 'key3': 'value3'}]
. This code duplication could be avoided by either parsing the state of the interface into a list of dicts directly, or by fixing, what I believe, an erroneous logic in the Rucio repository at https://github.com/rucio/rucio/blob/d7c7645b10dd6b904108bd94bd48f6ce14a20d90/lib/rucio/core/did_meta_plugins/__init__.py#L198 where it is assumed that the filters are a list of dicts before they have a chance to be converted with the aforementioned function.
Another issue is that the JSON plugin doesn't return some attributes like a DID type, and file size, which are required if we want to display the DIDs correctly in the panel with the search results. As a workaround, I've added a method get_metadata
to retrieve the regular metadata here: https://github.com/GeorgySk/jupyterlab-extension/blob/master/rucio_jupyterlab/rucio/rucio.py#L196 and I call it for every DID that lacks the required information here: https://github.com/GeorgySk/jupyterlab-extension/blob/a6fa76615592131bc09f5c91295dadbd7e7b6601/rucio_jupyterlab/handlers/did_search.py#L39-L44. I'm not entirely sure if this is a correct approach.
Also, something that I haven't thought about so far is the tests. I'll have to take a look at that.
Hello @GeorgySk Very good news ! I will try to give a try these days, but don't hesitate to create a PR once everything is ready to check that the test are passing correctly !
I downloaded your version and gave it a try. It seems to work properly. The only thing I found is that errors are not properly handled. For example:
rucio set-metadata --did test:file1 --key attr1 --value 1
rucio -vv list-dids test:* --filter "type=ALL,attr1=1"
I get this error:2024-08-20 10:39:17,470 ERROR Provided metadata is considered invalid. Details: Filter keys used do not all belong to the same metadata plugin.
I haven't yet looked into the tests or the documentation.
Thank you very much. Great effort!
The problem with Rucio API error handling is probably not related to this feature, but something general throughout the exension that should be improved.
I created a PR with some changes to be merged in @GeorgySk 's branch, mainly test fixes: https://github.com/GeorgySk/jupyterlab-extension/pull/1
I only fixed the already existing tests and included a minimal test on a React component. It would be good to include some tests additional tests specifically on the usage of metadata filters on the python side.
From the implementation pov I wonder if it would be possible to have another React component (e.g. MetadataFilterList), containing the list of metadata filters and the methods to manage them, in a separate module, instead of having everything in ExploreTab.tsx. This would also ease testing the addition, removal of new filters.
I don't think these improvements should be blocking in case there's an urgent need to have this feature in production. But, if there's no hurry, I'd rather implement them.
Thanks for the PR, @ftorradeflot! I will get back to you with a feedback on it later.
Me and Andres Tanasijczuk also discussed the new code on Slack, and he had some comments on it as well. I will post them here to have everything in one place.
json.dumps
is added but it seems to be not necessary as the Rucio client code doesn't have it as can be seen here and here.name
parameter from here probably should be added to the filters instead, like filters, _ = parse_did_filter_from_string_fe(filters, name)
. This is because right now the filters do not work together with specifying a name of a DID. The filters take precedence.get_metadata
gets called for each DID returned by the search which is not a good thing. The get_metadata
method is adding did_type
, bytes
and length
, which are the three properties that the json_meta
plugin is not returning compared to the did_column_meta
plugin. We might want to ignore the length
property as it is probably not used by the JupyterLab extension. We could probably also define an empty string as a default value for bytes
?.. And about the did_type
, we could propose to modify the json_meta
plugin by adding a line models.DIDMeta.did_type
here so that it returns also did_type
, since it seems to be available in the DIDMeta
class (see here). We could then check in the extension if there is a returned type. If there is not, we call get_metadata
only if the search type is datasets and containers
or everything
; if the search type is files
, datasets
or containers
we'd assume the returned type is just the search type.Regarding the first point about the json.dumps
. Indeed, it can be removed and the result will be the same. This is because, urlencode
will convert the filters
to a string under a hood. And while that string will be not a valid JSON, it won't matter, since it will be converted back to a list of dicts not by json.loads
as I incorrectly assumed but by ast.literal_eval
.
One thing to note here is that the extension wraps all the filter keys and values in quotes regardless of if they are numeric or not. Numbers will then be casted to numeric types by calls to ast.literal_eval
here, and the strings... will cause those calls to throw ValueError
s as the strings themselves do not contain quotes. The absence of inner quotes can break things if, for example, a string is not a valid Python syntax. For example, a filter key/value with a space in it, e.g. a b
will cause ast.literal_eval
to throw SyntaxError
which won't be caught. To fix this, we should probably have the extension adding a second set of quotes to non-numeric strings.
I've just pushed a fix for that issue with string values that are not valid Python syntax. Now all non-numeric keys/values get additional quotes that help preventing unwarranted SyntaxError
s in the Rucio code. Importantly, I've made an assumption that the keys cannot be numeric. IMHO, that's a safe assumption, but I wanted to ask what you think about it.
@ftorradeflot, I've just pushed a couple of more changes:
I've also looked into displaying errors for the case when there is a mismatch between types of filter values (e.g. when a file has a metadata value of a string type, but in the search field we specify a numeric value). But it seems like some changes should be introduced to the Rucio's code first. From what I understand, it doesn't handle response 500 correctly here. Instead, I'm getting dids
like this:
[{'ExceptionClass': 'InvalidMetadata', 'ExceptionMessage': 'Database query failed: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type double precision: "b"\n\n[SQL: SELECT dev.did_meta.name AS dev_did_meta_name, dev.did_meta.scope AS dev_did_meta_scope \nFROM dev.did_meta \nWHERE CAST((dev.did_meta.meta ->> %(meta_1)s) AS FLOAT) = %(param_1)s AND dev.did_meta.scope = %(scope_1)s]\n[parameters: {\'meta_1\': \'a\', \'param_1\': 3, \'scope_1\': \'test\'}]\n(Background on this error at: https://sqlalche.me/e/20/9h9h). This can be raised when the datatype of a key is inconsistent between dids.'}]
If this is fixed, then I'd only have to add a couple of lines here to display an error instead of just "No results found".
In principle, I've finished with the code. I'm not sure what else is left to do. I mentioned that the JSON plugin doesn't return some of the attributes, and that I had to add calls to Rucio to retrieve regular metadata for each DID. But, in the end, I don't think I can do anything on the side of the extension, as, for example, I cannot simply ignore the size of the file since it is actually displayed in the panel with the results.
Should I create a pull request?
I've just pushed a fix for that issue with string values that are not valid Python syntax. Now all non-numeric keys/values get additional quotes that help preventing unwarranted
SyntaxError
s in the Rucio code. Importantly, I've made an assumption that the keys cannot be numeric. IMHO, that's a safe assumption, but I wanted to ask what you think about it.
I second the assumption
@ftorradeflot, I've just pushed a couple of more changes:
- The search now works correctly when both filters and DID are specified.
- Styles have also been fixed, and now switching between light and dark themes works correctly (previously the colors didn't change).
I've also looked into displaying errors for the case when there is a mismatch between types of filter values (e.g. when a file has a metadata value of a string type, but in the search field we specify a numeric value). But it seems like some changes should be introduced to the Rucio's code first. From what I understand, it doesn't handle response 500 correctly here. Instead, I'm getting
dids
like this:[{'ExceptionClass': 'InvalidMetadata', 'ExceptionMessage': 'Database query failed: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type double precision: "b"\n\n[SQL: SELECT dev.did_meta.name AS dev_did_meta_name, dev.did_meta.scope AS dev_did_meta_scope \nFROM dev.did_meta \nWHERE CAST((dev.did_meta.meta ->> %(meta_1)s) AS FLOAT) = %(param_1)s AND dev.did_meta.scope = %(scope_1)s]\n[parameters: {\'meta_1\': \'a\', \'param_1\': 3, \'scope_1\': \'test\'}]\n(Background on this error at: https://sqlalche.me/e/20/9h9h). This can be raised when the datatype of a key is inconsistent between dids.'}]
If this is fixed, then I'd only have to add a couple of lines here to display an error instead of just "No results found".
In principle, I've finished with the code. I'm not sure what else is left to do. I mentioned that the JSON plugin doesn't return some of the attributes, and that I had to add calls to Rucio to retrieve regular metadata for each DID. But, in the end, I don't think I can do anything on the side of the extension, as, for example, I cannot simply ignore the size of the file since it is actually displayed in the panel with the results.
Should I create a pull request?
Yes, @GeorgySk I think you can create a pull request
The DID search in the jupyterlab extension only takes into account three parameters: name, scope and type.
It would be useful to be able to filter the DIDs by user-defined metadata. This could be done by adding an additional "filters" parameter in the DID search and forwarding it to the "filters" argument in the /dids/{scope}/dids/search REST API endpoint that relies on rucio.core.did.list_dids method.
The "filters" argument is not listed in the documentation of the REST API method, but it seems to be there when looking into the underlying code
An effort to implement this feature has been started in this fork