irods / python-irodsclient

A Python API for iRODS
Other
62 stars 73 forks source link

Using anonymous+ticket to get() on multiple files fails where authenticated get() works #313

Open twnone opened 2 years ago

twnone commented 2 years ago

Hello,

While trying the anonymous ticket-based access, I'm encountering an issue which cannot be reproduced with authenticated access, especially when it comes to get() multiple files sequentially.

Context

Versions

iRODS server : 4.2.9 python-irodsclient version : commit 05309de

Python client usage

I am using an fsspec implementation of irods in order to be able to access data seemlessly from intake catalogs (using xarray & pandas as backends).

In the end, irods-fsspec simply uses the python-irodsclient library :

I created a ticket to allow anonymous user to read data from specific directories :

id: 327917
string: MuS0EO0mKn6jTNx
ticket type: read
obj type: collection
owner name: pillar_ifremer
owner zone: Ifremer
uses count: 0
uses limit: 0
write file count: 0
write file limit: 0
write byte count: 0
write byte limit: 0
expire time: none
collection name: /Ifremer/EOSC/pillar/data
No host restrictions
No user restrictions
No group restrictions

What works

What doesn't work :

The python client successfully lists the directories included in the dataset in order to identify the files to get(). Then, the client will call get() on each file to read ; at this point, irods-pythonclient fails with the following exception :

/opt/conda/envs/data-terra/lib/python3.9/site-packages/irods/manager/data_object_manager.py in get(self, path, local_path, num_threads, **options)
     99 
    100     def get(self, path, local_path = None, num_threads = DEFAULT_NUMBER_OF_THREADS, **options):
--> 101         parent = self.sess.collections.get(irods_dirname(path))
    102 
    103         # TODO: optimize

/opt/conda/envs/data-terra/lib/python3.9/site-packages/irods/manager/collection_manager.py in get(self, path)
     27                     continue
     28                 print(f'irods collections manager :: no results found for collection {path} on last iteration')
---> 29                 raise CollectionDoesNotExist()
     30             return iRODSCollection(self, result)
     31 

CollectionDoesNotExist: 

As I previously said, the client successfully listed the collections and files by browsing the collections. But when it tries to get() the files, the collection cannot be found based on the query specified in CollectionManager.get() in both attempts done when a ticket is supplied to the session.

I enabled debug logging on the icat server, but there are a lot of information. Right now, the only relevant traces in the icat server log could be the following :

Nov 17 11:10:42 pid:4739 DEBUG1: parseXmlValue: XML start tag error for </KeyValPair_PI><InxIvalPair_PI><iiLen>10</iiLen><inx>500</inx><inx>501</inx><inx>502</inx><inx>503</inx><inx>504</inx><inx>505</inx><inx>506</inx><inx>507</inx><inx>508</inx><inx>509</inx><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue><ivalue>1</ivalue></InxIvalPair_PI><InxValPair_PI><isLen>2</isLen><inx>501</inx><inx>401</inx><svalue>= '/Ifremer/EOSC/pillar/data/argo/parquet/year=2011/month=5'</svalue><svalue>&lt;&gt; '0'</svalue></InxValPair_PI></GenQueryInp_PI>, expect <svalue>

This shows that the python client issued the query [DataObject.id != 0] which seems to be malformed ?

Hopefully my issue is correctly described as it is not a common usage of python-irodsclient ...

Thank you for your help ! Antoine

d-w-moore commented 2 years ago

@twnone I'm curious what happens if you use read ( ) instead of get ( ) ?

  for d in c.data_objects:
     content = d.open('r').read( )
d-w-moore commented 2 years ago

You are right in that a filter of DataObject.id != 0could be seen as malformed. It was a dummy condition, necessary for when tickets were corrupting the proper function of GenQuery and inclusion of the DataObject column wasn't sufficient by itself .... : /

twnone commented 2 years ago

@twnone I'm curious what happens if you use read ( ) instead of get ( ) ?

  for d in c.data_objects:
     content = d.open('r').read( )

@d-w-moore : in irods-fsspec, the code responsible for data objects retrieval is the following : data_obj = self.session.data_objects.get(path) That is the failing get() ending with CollectionDoesNotExist while looking for the parent directory & it cannot be replaced with read() because DataObjectManager class does not implement it.

Thank you for your help