irods / python-irodsclient

A Python API for iRODS
Other
63 stars 73 forks source link

PRC seems to ignore the "irods_default_hash_scheme" in the environment.json #610

Open chStaiger opened 3 weeks ago

chStaiger commented 3 weeks ago

While transferring data I noticed that the iRODS server uses different hash schemes for the checksums depending on the client I use.

In my irods_environment.json I set the checksum algorithm as below:

cstaiger@integration:~$ cat .irods/irods_environment.json | grep default_hash_scheme
    "irods_default_hash_scheme": "md5",

On the server sha256 is the default checksum algorithm.

When I use the icommands to upload data, the data is checked by md5 sums:

cstaiger@integration:~$ ils -L hello_iput.txt
  cstaiger     0 irodsResc          12 2024-08-22.05:40 & hello_iput.txt
    6f5902ac237024bdd0c176cb93063dc4    generic    /mnt/irods03/home/.../hello_iput.txt

When I transfer data with the PRC v 2.0.1. sha2 is used as checksum algorithm:

>>> import irods.session
>>> sess = irods.session.iRODSSession(irods_env_file = ".irods/irods_environment.json")
>>> sess.data_objects.put("hello.txt", "/nluu12p/home/research-test-christine/hello_prc.txt", **{irods.keywords.REG_CHKSUM_KW: ""})
>>>
cstaiger@integration:~$ ils -L hello_prc.txt
  cstaiger      0 irodsResc           12 2024-08-22.05:48 & hello_prc.txt
    sha2:qUiQTy8PR5uPgZdpSzAYSw0u0cHNKh7A+4XSmaGSpEc=    generic    /mnt/irods03/Vault/home/../hello_prc.txt

Is there an extra parameter which I have to pass to the PRC to ensure that the data is checksummed by md5?

alanking commented 3 weeks ago

How did you upload the data for the iCommands example? I'm assuming you used iput, but it would be helpful to know which iCommand and options were used.

I see that REG_CHKSUM_KW is being used in the PRC put. I believe that this is equivalent to iput -k, which means...

 -k  checksum - calculate a checksum on the data server-side, and store
       it in the catalog.

That would mean that the checksum only needs to be calculated on the server side, and it would appear that it uses the hash scheme configured for that server.

What you're looking for, I think, is the equivalent of iput -K:

 -K  verify checksum - calculate and verify the checksum on the data, both
       client-side and server-side, and store it in the catalog.

This feature uses VERIFY_CHKSUM_KW to calculate the checksum on the client side, re-calculate it on the server side (using the same hash scheme as was used by the client-side calculation), and then ensures that they match.

You could try using VERIFY_CHKSUM_KW instead. However, DataObjectManager.put does not appear to implement the client-side checksum calculation like iput. My impression is that you can only register a checksum based on a server-side checksum calculation and there's no built-in way to verify the checksum against the local data.

I'll mark this as a bug, but I view it more as a missing feature rather than something not working. We can play with the labels. :)

@d-w-moore - Does that seem right? Am I missing something?

chStaiger commented 3 weeks ago

I am sorry, I forgot to copy that command over. Indeed I used:

iput -K hello.txt hello_iput.txt

And the version of the icommands is 4.3.1-0~bionic.

trel commented 3 weeks ago

In case this is news - there is a little section on checksums in the README...

https://github.com/irods/python-irodsclient?tab=readme-ov-file#computing-and-retrieving-checksums