irods-contrib / metalnx-web

Metalnx Web Application
https://metalnx.github.io/
BSD 3-Clause "New" or "Revised" License
36 stars 36 forks source link

downloading multiple files hangs with DIRECT_CHILD_ACCESS when compound resource involved #331

Closed trel closed 2 months ago

trel commented 1 year ago

Bug

iRODS 4.3.0 with Metalnx 2.6.0

Resource hierarchy:

$ ilsresc
defaultResc:passthru
└── compHCP:compound
    ├── archiveHCP:unixfilesystem
    └── demoResc:unixfilesystem

Two data objects, each with two replicas:

$ ils -L
/tempZone/home/rods:
  rods              0 defaultResc;compHCP;demoResc          284 2022-11-16.06:10 & bananas
        generic    /var/lib/irods/Vault/home/rods/bananas
  rods              1 defaultResc;compHCP;archiveHCP          284 2022-11-16.06:10 & bananas
        generic    /tmp/archiveHCP/home/rods/bananas
  rods              0 defaultResc;compHCP;demoResc          239 2022-11-16.06:10 & foo
        generic    /var/lib/irods/Vault/home/rods/foo
  rods              1 defaultResc;compHCP;archiveHCP          239 2022-11-16.06:10 & foo
        generic    /tmp/archiveHCP/home/rods/foo

In Metalnx, selecting both data objects and clicking Action:Download results in a hang of the web application.

With this in the iRODS log:

irods-catalog-provider_1  | {"log_category":"legacy","log_facility":"local0","log_level":"error","log_message":"[-]\t/irods_source/server/core/src/irods_resource_redirect.cpp:130:irods::error (anonymous namespace)::request_vote_for_file_object(rsComm_t *, const std::string &, const std::string &, irods::file_object_ptr, std::string &, float &) :  status [DIRECT_CHILD_ACCESS]  errno [] -- message [attempt to directly address a child resource]\n\n","request_api_name":"DATA_OBJ_COPY_AN","request_api_number":696,"request_api_version":"d","request_client_user":"rods","request_host":"10.15.21.8","request_proxy_user":"rods","request_release_version":"rods3.2","server_host":"2e8df3fb41cc","server_pid":47,"server_timestamp":"2022-11-16T06:12:54.720Z","server_type":"agent"}
irods-catalog-provider_1  | {"log_category":"legacy","log_facility":"local0","log_level":"error","log_message":"[rsDataObjOpen_impl:904] - [DIRECT_CHILD_ACCESS: [-]\t/irods_source/server/core/src/irods_resource_redirect.cpp:130:irods::error (anonymous namespace)::request_vote_for_file_object(rsComm_t *, const std::string &, const std::string &, irods::file_object_ptr, std::string &, float &) :  status [DIRECT_CHILD_ACCESS]  errno [] -- message [attempt to directly address a child resource]\n\n\n\n] [error_code=[-1816000], path=[/tempZone/home/rods/.jargonZipService/zipServiceBundle-1668579174605-963821533/bananas], hierarchy=[]","request_api_name":"DATA_OBJ_COPY_AN","request_api_number":696,"request_api_version":"d","request_client_user":"rods","request_host":"10.15.21.8","request_proxy_user":"rods","request_release_version":"rods3.2","server_host":"2e8df3fb41cc","server_pid":47,"server_timestamp":"2022-11-16T06:12:54.720Z","server_type":"agent"}
irods-catalog-provider_1  | {"log_category":"legacy","log_facility":"local0","log_level":"error","log_message":"[open_destination_data_obj:144] - failed to open destination object [irods error=[DIRECT_CHILD_ACCESS], system error=[], path=[/tempZone/home/rods/.jargonZipService/zipServiceBundle-1668579174605-963821533/bananas]]","request_api_name":"DATA_OBJ_COPY_AN","request_api_number":696,"request_api_version":"d","requst_client_user":"rods","request_host":"10.15.21.8","request_proxy_user":"rods","request_release_version":"rods3.2","server_host":"2e8df3fb41cc","server_pid":47,"server_timestamp":"2022-11-16T06:12:54.720Z","server_type":"agent"}

And this in the metalnx logs:

metalnx_1                 | 2022-11-16 06:12:54 ERROR DataObjectAOImpl:2530 - error copying irods file
metalnx_1                 | org.irods.jargon.core.exception.ResourceHierarchyException: DIRECT_CHILD_ACCESS
metalnx_1                 |     at org.irods.jargon.core.connection.IRODSErrorScanner.checkSpecificCodesAndThrowIfExceptionLocated(IRODSErrorScanner.java:239)
metalnx_1                 |     at org.irods.jargon.core.connection.IRODSErrorScanner.inspectAndThrowIfNeeded(IRODSErrorScanner.java:115)
metalnx_1                 |     at org.irods.jargon.core.connection.IRODSMidLevelProtocol.processMessageInfoLessThanZero(IRODSMidLevelProtocol.java:1542)
trel commented 1 year ago

This has also been reported with a 4.2.11 iRODS server.

I believe the bug is in the zipBundleService not using the correct target resource when gathering the data objects before zipping them. Additional evidence that this is the problem is an empty/unpopulated zipBundle collection remaining in the user's home collection after the error is encountered.

trel commented 1 year ago

This may require a more thorough effort than just fixing a variable somewhere.

The zipBundleService gathers the files - but it doesn't have a good non-hierarchy place to put them before zipping.

Initial thinking is a new metalnx.property that is specific to this gathering (a place to write things down temporarily, but doesn't feel right since that is a client-side 'workaround', and would have to have enough space to support multiple big files being requested at once)... will continue to brainstorm.

kovid20 commented 4 months ago

@trel Please let us know the timelines by when this issue will get fixed in upcoming releases of metalnx, as our application users are facing issue while downloading multiple files from front end. If there is any temporary workaround available please let us know .

trel commented 4 months ago

Hi @kovid20 - there is not currently a timeline for fixing this issue. It is a fundamental assumption of being able to talk directly to the resource where the replica lives. Other clients do not make this assumption, which is why it has not appeared elsewhere.

A workaround for you is to not have a compound resource. As I understand it, the HCP device is a linux mountpoint, and could be addressed directly, rather than as a child of a compound.

I would recommend...

$ ilsresc
archiveHCP:unixfilesystem
demoResc:unixfilesystem

Then, I believe the zipBundleService will no longer throw a DIRECT_CHILD_ACCESS error and you should be able to download multiple files as you are expecting.

trel commented 4 months ago

I am failing to reproduce this today with iRODS 4.3.1 and Metalnx 2.6.1.

trel commented 4 months ago

Now failing to reproduce this with iRODS 4.3.0 and Metalnx 2.6.0 - same versions as originally reported.

trel commented 4 months ago

A breakthrough!

I have been able to reproduce the DIRECT_CHILD_ACCESS error - but only when the server's default resource is in a hierarchy as a child. The problem is not that Metalnx is trying to READ from a child directly, it's that it is trying to WRITE to a child directly.

Metalnx is using Jargon, the Java iRODS client library, and is issuing copy operations to stage the multiple files to the same logical and physical place before zipping them and sending the zip to the user. It then cleans up the staging area.

If the server handling the connection has a default resource in a hierarchy, then the copy destination is determined by that default resource and the copy operation fails with DIRECT_CHILD_ACCESS, which is correct.

The workaround is to have the default resource defined by the iRODS server (in core.re) set to a resource that is NOT in a hierarchy (this is true in general, not just for this issue - default should never be IN a hierarchy as a child).

A better long-term solution is to provide a new option to Metalnx that defines the location for the zipService to target while staging the zip operation. Created an issue for that here... https://github.com/irods-contrib/metalnx-web/issues/356

kovid20 commented 2 months ago

Hi Terrel, Could you help me with how to set parent context of storage resource ?

trel commented 2 months ago

That field is populated by the iadmin addchildtoresc command...

$ iadmin addchildtoresc theparent thechild the_relationship

$ ilsresc -l thechild | grep "parent "
parent context: the_relationship
kovid20 commented 2 months ago

Hi Terrel,

If we make changes in metalnx.properties (front end ) without touching the irods hierarchy. , is there any possibility that multiple downloads will work ?

trel commented 2 months ago

No. Multiple downloads require bundling - which requires a place to gather (write) the multiple files before zipping them together.

If the server handling the connection has a default resource in a hierarchy, then the copy destination is determined by that default resource and the copy operation fails with DIRECT_CHILD_ACCESS, which is correct.

The workaround is to have the default resource defined by the iRODS server (in core.re) set to a resource that is NOT in a hierarchy (this is true in general, not just for this issue - default should never be IN a hierarchy as a child).

Having a default resource in a hierarchy is a misconfiguration of the iRODS server.

Change your hierarchy and the problem is solved.

kovid20 commented 2 months ago

Hi Terrell, Thanks for your response. Just to avoid confusion, wanted to ask : we have demoResc set in core.re file

acSetRescSchemeForCreate {msiSetDefaultResc("demoResc","null"); }
acSetRescSchemeForRepl {msiSetDefaultResc("demoResc","null"); }

and output of ilsresc is :

defaultResc:passthru
└── compHCP:compound
    ├── archiveHCP:s3
    └── demoResc:unixfilesystem
s3resc:s3

According to you, which resource should be moved out of hierarchy ?

trel commented 2 months ago

You need to move demoResc out of the hierarchy (because it is the one defined as the default via msiSetDefaultResc).

Being in the hierarchy is the problem when NFSRODS requests a write when it is gathering multiple files together.