irods / python-irodsclient

A Python API for iRODS
Other
62 stars 73 forks source link

PRC seems to ignore the "irods_default_hash_scheme" in the environment.json #610

Open chStaiger opened 2 months ago

chStaiger commented 2 months ago

While transferring data I noticed that the iRODS server uses different hash schemes for the checksums depending on the client I use.

In my irods_environment.json I set the checksum algorithm as below:

cstaiger@integration:~$ cat .irods/irods_environment.json | grep default_hash_scheme
    "irods_default_hash_scheme": "md5",

On the server sha256 is the default checksum algorithm.

When I use the icommands to upload data, the data is checked by md5 sums:

cstaiger@integration:~$ ils -L hello_iput.txt
  cstaiger     0 irodsResc          12 2024-08-22.05:40 & hello_iput.txt
    6f5902ac237024bdd0c176cb93063dc4    generic    /mnt/irods03/home/.../hello_iput.txt

When I transfer data with the PRC v 2.0.1. sha2 is used as checksum algorithm:

>>> import irods.session
>>> sess = irods.session.iRODSSession(irods_env_file = ".irods/irods_environment.json")
>>> sess.data_objects.put("hello.txt", "/nluu12p/home/research-test-christine/hello_prc.txt", **{irods.keywords.REG_CHKSUM_KW: ""})
>>>
cstaiger@integration:~$ ils -L hello_prc.txt
  cstaiger      0 irodsResc           12 2024-08-22.05:48 & hello_prc.txt
    sha2:qUiQTy8PR5uPgZdpSzAYSw0u0cHNKh7A+4XSmaGSpEc=    generic    /mnt/irods03/Vault/home/../hello_prc.txt

Is there an extra parameter which I have to pass to the PRC to ensure that the data is checksummed by md5?

alanking commented 2 months ago

How did you upload the data for the iCommands example? I'm assuming you used iput, but it would be helpful to know which iCommand and options were used.

I see that REG_CHKSUM_KW is being used in the PRC put. I believe that this is equivalent to iput -k, which means...

 -k  checksum - calculate a checksum on the data server-side, and store
       it in the catalog.

That would mean that the checksum only needs to be calculated on the server side, and it would appear that it uses the hash scheme configured for that server.

What you're looking for, I think, is the equivalent of iput -K:

 -K  verify checksum - calculate and verify the checksum on the data, both
       client-side and server-side, and store it in the catalog.

This feature uses VERIFY_CHKSUM_KW to calculate the checksum on the client side, re-calculate it on the server side (using the same hash scheme as was used by the client-side calculation), and then ensures that they match.

You could try using VERIFY_CHKSUM_KW instead. However, DataObjectManager.put does not appear to implement the client-side checksum calculation like iput. My impression is that you can only register a checksum based on a server-side checksum calculation and there's no built-in way to verify the checksum against the local data.

I'll mark this as a bug, but I view it more as a missing feature rather than something not working. We can play with the labels. :)

@d-w-moore - Does that seem right? Am I missing something?

chStaiger commented 2 months ago

I am sorry, I forgot to copy that command over. Indeed I used:

iput -K hello.txt hello_iput.txt

And the version of the icommands is 4.3.1-0~bionic.

trel commented 2 months ago

In case this is news - there is a little section on checksums in the README...

https://github.com/irods/python-irodsclient?tab=readme-ov-file#computing-and-retrieving-checksums

d-w-moore commented 1 month ago

@trel What's our milestone to be for this one?

korydraughn commented 1 month ago

Let's get the remaining issues for 2.1.1 resolved and handle this in 3.0.

trel commented 1 month ago

Yep

d-w-moore commented 3 days ago

I guess it makes sense for us to respect irods_match_hash_policy as well.

korydraughn commented 3 days ago

Let's discush first.

d-w-moore commented 3 days ago

For pre-consideration in discush : I noticed iput has both -K (affected by client's default hash scheme) and -k (not affected), whereas istream has only -k. This doesn't mean much to me, except perhaps that it's possible the Python iRODS Client "put", being an open/write/close, may like istream write have different potential capabilities than an iput. FWIW....

d-w-moore commented 3 days ago

ichksum has -K , and so that and the data object .chksum() method is probably will probably be more our point of reference - I would hazard a guess.

korydraughn commented 3 days ago

@chStaiger After some discussion, we landed at the following ...

In your original issue, you're comparing iput to PRC put. iput uses the PUT API whereas the PRC put uses open/write/close (i.e. streaming operations). The streaming operations do not support client-side checksum operations like iput.

You'd need to provide your own implementation for the behavior you're describing.