irods-contrib / metalnx-web

Metalnx Web Application
https://metalnx.github.io/
BSD 3-Clause "New" or "Revised" License
36 stars 36 forks source link

Download of multiple files fails when a file is only on an archive resource under compound #222

Closed kuntzagk closed 2 years ago

kuntzagk commented 3 years ago

When we mark multiple files in collection browser and select "Download" it hangs in "Preparing files for Download"

metalnx_multiple_download_error

In the log of MetaLnx (attached) we see that it fails to copy the files into a temporary jargonZipService with a "DIRECT_CHILD_ACCESS" error.

metalnx_multiple_download_error.log

This is probably caused by files residing on a compound resource.

trel commented 3 years ago

You are probably correct...

Can you share the output of ilsresc?

And what is the default resource defined on the iRODS Server that this Metalnx is connected to?

Are all the files being gathered for zip... on the archive? on the cache? on other resources?

trel commented 3 years ago

Confirmed - downloading works if the data object managed by a compound resource has a replica on the 'cache' resource.

If the data object only has a replica in the 'archive', the above failure is seen.

In 4-2-stable... the error is a bit more specific:

Apr 22 12:51:59 pid:1284 remote addresses: 192.168.224.2, ::1 ERROR: iRODS Exception:
    file: ../server/api/src/rsDataObjOpen.cpp
    function: int (anonymous namespace)::rsDataObjOpen_impl(rsComm_t *, dataObjInp_t *)
    line: 937
    code: -164000 (SYS_REPLICA_DOES_NOT_EXIST)
    message:
        [rsDataObjOpen_impl] - no replica found for [/tempZone/home/rods/compfile] on [compResc;cacheResc]

The solution will be to have metalnx and/or jargon not specify which replica to put into the jargonZipService - and to let the server decide via its internal voting mechanism.

trel commented 3 years ago

Even if the only file downloaded is the data object with a replica only on the 'archive' resource (no replica on 'cache'), Metalnx sends a 500 error. If this simpler case is fixed, then the multi-file download will probably begin to work as well.

kuntzagk commented 3 years ago

The two files in my test ("test1" and "testbha") are both only on the cache resource. Version of server here is 4.2.8

Here is the ilsresc and ils -L for the collection:

[mywtl@by0uez ~]$ ils -L /biomarkerPocZone/test
/biomarkerPocZone/test:
  sgpeo             1 defaultResc;compHCP;archiveHCP     49853441 2019-01-29.11:26 & charles-proxy-win64-3.8.3.msi
    sha2:WvHGlL3BAYw+5cdTrdddY1dNXyEkj6VoNiP005Ovdxk=    generic    /bmda-dev/12001/charles-proxy-win64-3.8.3.msi
  sgpeo             2 defaultResc;compHCP;demoResc     49853441 2019-02-26.09:20 & charles-proxy-win64-3.8.3.msi
    sha2:WvHGlL3BAYw+5cdTrdddY1dNXyEkj6VoNiP005Ovdxk=    generic    /irods/vault/test/charles-proxy-win64-3.8.3.msi
  eulkd             0 defaultResc;compHCP;demoResc           25 2020-10-06.15:37 & test1
    sha2:UIunBVFk733YWLea+Rack0I7fJa+PYi1pLw3uSeQQnU=    generic    /irods/vault/test/test1
  eulkd             0 defaultResc;compHCP;demoResc           25 2020-10-06.15:42 & testbha
    sha2:UIunBVFk733YWLea+Rack0I7fJa+PYi1pLw3uSeQQnU=    generic    /irods/vault/test/testbha
[mywtl@by0uez ~]$ ilsresc
defaultResc:passthru
└── compHCP:compound
    ├── archiveHCP:s3
    └── demoResc:unixfilesystem
trel commented 3 years ago

Hmm - I have only reproduced this when the 'cache' replica does not exist.

With this...

$ ils -L compfile1 compfile2
  rods              0 pt;compResc;cacheResc          166 2021-04-23.09:19 & compfile1
        generic    /tmp/cacheRescVault/home/rods/compfile1
  rods              0 pt;compResc;cacheResc       734392 2021-04-23.09:18 & compfile2
        generic    /tmp/cacheRescVault/home/rods/compfile2

compfile1 and compfile2 both download fine by themselves individually... AND together as a multi-file download using the jargonZipService bundle file.

Regardless, without a 'cache' file, there is a problem, and we'll address that first.

korydraughn commented 2 years ago

So far, I've only been able to reproduce this issue when the cache replica does not exist.

trel commented 2 years ago

The solution will be to have metalnx and/or jargon not specify which replica to put into the jargonZipService - and to let the server decide via its internal voting mechanism.

@korydraughn I think this is the 'fix'.

korydraughn commented 2 years ago

Right, I'm seeing at least one reference to a replNumber=0 in the log output. I also noticed that right after the failure, a new replica appears on the cache resource.

trel commented 2 years ago

Ah, so perhaps just a little order-of-operations and we're back in business...

korydraughn commented 2 years ago

Looking at the Metalnx/Jargon code so far, I'm not seeing where a target resource/replica is being specified and based on the log output, you can see that Jargon is defaulting to the default resource.

The error occurs here (from the log output): https://github.com/DICE-UNC/jargon/blob/4d36076b5567247ae40d02f0c27d90ded5bf8d41/jargon-core/src/main/java/org/irods/jargon/core/pub/DataObjectAOImpl.java#L2528

Within that function, we see a log message that prints the target resource: https://github.com/DICE-UNC/jargon/blob/4d36076b5567247ae40d02f0c27d90ded5bf8d41/jargon-core/src/main/java/org/irods/jargon/core/pub/DataObjectAOImpl.java#L2476 The output for the target resource is empty. The rodsLog output also mentions that the target replica does not exist. In this case, the target replica is the one on the cache resource.

I feel this is a situation where Jargon needs to pick the latest good replica. If it is doing that, I haven't spotted it yet. I'll continue to investigate and post my findings here.

trel commented 2 years ago

so, perhaps... the zip service should 'ensure' or 'confirm' that there is a replica in a cache before attempting the 'read'? or does that cross too many intended-to-be-separate pieces...

trel commented 2 years ago

Works with 4.2.8.

Moving this issue to irods/irods as this is a server bug - demonstrated in 4.2.9 with just icp.

Edit: cannot transfer since irods/irods is in a different GitHub organization. Closing and will link here from a new issue in irods/irods.