non-printable characters make it into metadata, but the response breaks the XML parser

trel commented 2 months ago

Running the following code...

from irods.session import iRODSSession

with iRODSSession(host='localhost', port=1247, user='rods', password='rods', zone='tempZone') as session:

    hexvalue = "53:6b:6f:70:65:43:61:6c:58:c3:af:c2:bf:c2:bd:c3:af:c2:bf:c2:bd:23:01:32:64:31"
    hex_list = [int(byte, 16) for byte in hexvalue.split(":")]
    string_value = "".join([chr(byte) for byte in hex_list])

    obj = session.data_objects.get('/tempZone/home/rods/file.txt')

    attr_str = "awesome"
    value_str = string_value

    obj.metadata.add( attr_str, value_str )

The AVU makes it in, but the response is not parsed correctly by the PRC.

$ python pyclient.py 
Traceback (most recent call last):
  File "pyclient.py", line 14, in <module>
    obj.metadata.add( attr_str, value_str )
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/data_object.py", line 86, in metadata
    self._meta = iRODSMetaCollection(
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/meta.py", line 89, in __init__
    self._reset_metadata()
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/meta.py", line 92, in _reset_metadata
    self._meta = self._manager.get(self._model_cls, self._path)
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/manager/metadata_manager.py", line 82, in get
    results = self.sess.query(*columns).filter(*conditions).all()
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/query.py", line 212, in all
    result_set = self.execute()
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/query.py", line 198, in execute
    results = result_message.get_main_message(GenQueryResponse)
  File "/home/tgr/venv3/lib/python3.8/site-packages/irods/message/__init__.py", line 400, in get_main_message
    msg.unpack(ET().fromstring(self.msg))
  File "/usr/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 19, column 29

$ imeta ls -d file.txt
AVUs defined for dataObj /tempZone/home/rods/file.txt:
attribute: awesome
value: SkopeCalXÃ¯Â¿Â½Ã¯Â¿Â½#2d1
units:

We should handle this better... but where?

d-w-moore commented 2 months ago

I think if you allow for UTF 8 encoding by iRODS internally, we're fine.

I set my XML parser to QUASI_XML for this experiment:

$ more ~/.python_irodsclient 
data_objects.auto_close True
connections.xml_parser_default "QUASI_XML"
$ PYTHON_IRODSCLIENT_CONFIGURATION_PATH="" python  unprintable_meta.py 
<class 'str'>
v_bytes=[83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]
utf8avu=[83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]

...and unprintable_meta.py is the script below:

#!/usr/bin/env python3
from irods.meta import *
import irods.test.helpers as h

v = b'SkopeCalX\xc3\xaf\xc2\xbf\xc2\xbd\xc3\xaf\xc2\xbf\xc2\xbd#\x012d1'
v_bytes = [83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]
assert bytearray(v_bytes) == v, "original value not correct"

avu = iRODSMeta ('a',v)

s = h.make_session()
dn = '/tempZone/home/rods/100m'  # _-_-_ use any data object here !!!
d = s.data_objects.get(dn)
d.metadata.set(avu)

# --- retrieve / parse for printing
retrieved_avus = [_ for _ in d.metadata.items() if _.name == 'a']
print(type(retrieved_avus[0].value))

utf8avu = [i for i in bytearray(retrieved_avus[0].value.encode('utf-8'))]
#
# returned string won't match ORIGINAL avu value's v_bytes because it is retrieved as a unicode value
#assert([ord(_) for _ in retrieved_avus[0].value] == v_bytes), "retrieved value not intact (unless you assume iRODS prefers UTF8 to straight (Latin?/ascii-bin?) encoding)"

# But, these are equal:
print(f'{v_bytes=}')
print(f'{utf8avu=}')
assert v_bytes == utf8avu

d-w-moore commented 2 months ago

We should handle this better... but where?

If it turns out the UTF-8 thing isn't the issue with the original demo script, I'll look a bit deeper.

d-w-moore commented 2 months ago

I also can't vouch for ElementTree (the default XML parser) doing "the right thing" (at least where the iRODS server is concerned).

At this point ... I think QUASI_XML does better.

d-w-moore commented 2 months ago

The QUASI_XML setting addresses the breaking of the XML parser (ElementTree was never really meant for handling such characters in the PackStruct XML-ish dialect, and do we change packstruct now?).

As to what imeta ls should have dug out of that foiled avu-setting attempt, I'll look into it, as mentioned, but i suspect iRODS trades on UTF-8 rather than ascii binary. So for example, if in the future we would like to store b'\xef' in an AVU field, we won't have very good luck; but we may do better storing Unicode u'\u00ef' in its place - which, one can can see from the printout in my example output, will be encoded as bytes [195,175] by iRODS due to its internal use of UTF-8.

trel commented 2 months ago

I see that QUASI_XML DOES prevent the exception on parsing the response...

$ irm -rf file.txt
$ itouch file.txt
$ imeta ls -d file.txt
AVUs defined for dataObj /tempZone/home/rods/file.txt:
None
$ PYTHON_IRODSCLIENT_CONFIGURATION_PATH="" PYTHON_IRODSCLIENT_CONFIG__CONNECTIONS__XML_PARSER_DEFAULT="'QUASI_XML'" python pyclient.py
WARNING:irods.client_configuration:Config file not available at '/home/trel/.python_irodsclient'
$ imeta ls -d file.txt
AVUs defined for dataObj /tempZone/home/rods/file.txt:
attribute: awesome
value: SkopeCalXÃ¯Â¿Â½Ã¯Â¿Â½#2d1
units:

d-w-moore commented 2 months ago

So then, how do we best address the issue under the default parser?

artificially suppress the exception when the AVU turns out to have made it into the catalog
if the exception happens, catch it, remove the AVU from the catalog, and rethrow
print or log a warning that the metadata operation succeeded despite error, and recommend using QUASI_XML

trel commented 2 months ago

So then, how do we best address the issue under the default parser?

artificially suppress the exception when the AVU turns out to have made it into the catalog

how would we know this / can we? is this a reasonable thing to do? is what's in the database... correct?

if the exception happens, catch it, remove the AVU from the catalog, and rethrow

this feels... better than a)

or... can we catch non-standard / non-utf8 characters on the way into the database for metadata in the first place? and then this issue of parsing a 'bad' response becomes moot?

print or log a warning that the metadata operation succeeded despite error, and recommend using QUASI_XML

this is a good backstop/minimal answer if we can't figure out something more 'active'.

d-w-moore commented 2 months ago

I'll look at these, and whether the AVU entered under exception was correct.

There is a fourth choice, and that is to make QUASI_XML the default parser. This seems to be in line with the server's behavior; if iRODS is entering it into the database, then it is likely also returning a non-error response, which we're then choking on. QUASI_XML was written for the express purpose of being less "standard" XML and more iRODS xml-ish.

d-w-moore commented 2 months ago

Also there are two pathways for the same binary data. Let's consider the four-byte binary string variously expressed as a bytestring b"a\xC3\xAFb", or as the direct mapping via ordinal codes into a Unicode string, (in Python3 this is "a\xC3\xEFb", but it is also expressible in both Py2/Py3 unambiguously as as u"a\u00C3\u00AFb".)

Thus there are two ways to give this data to the AVU .set (or .add) call, expounded on below. Before reading on. please be aware that direct mapping via the ordinal values is not the same as the UTF8 way of doing things. In that encoding, we assume binary data and Unicode strings to be equivalent only if they've passed through a call to encode( ) or decode( ), which may change the length of the string by replacing any higher-ordinal-value Unicode character (integer code >= 0x80) by multiple bytes, that is encoding; and then reversing that transformation in the decode( )call ie. when going back to the Unicode string.

So here are our two options of how to feed the given binary sequence, as promised:

(1) pass in the unicode value, ie transformed via the method used in the issue description script: "".join([chr(i) for i in [0x61,0xc3,0xaf,0x62]]) . (Note the Py3 call chr(i) is equivalent to Py2 unichr(i)). This generates unicode value we mentioned, that is u"a\0u00C3\u00AFb" . Suppose we then hand that Unicode direct-ordinal-value mapping of the binary data to the metadata .set/.add call as the AVU value to be used. What you get back out of the catalog will be an exactly equal, if running in PRC under Python3, because then the PRC prefers to return Unicode. (Under Python2 we'd get the encoded bytes b'a\xc3\x83\xc2\xafb' because that's what PRC under Python2 prefers. Still equivalent though, because the original value we jammed in was the UTF-8 decoding of that string.)

(2) pass in the bytes value mentioned previously: b"a\xC3\xAFb" . If that bytes value is fed to the AVU set( ) or add( ) call, it will be seen as a bytestring and therefore interpreted as equivalent to the three-character unicode sequence: u"a\u00EFb" . What you get back from iRODS and the catalog on a metadata read will then be a bytestring in Python2 or a Unicode string in Python3, but the understood equivalency via str.decode('utf8') and str.encode('utf8') will be preserved.

d-w-moore commented 2 months ago

I mention all of this because it affects what you would consider as equal (or equivalent) in terms of binary data. And in light of it, I've found -- so far -- that PRC under Py3 (as also under Py2) does the right thing in both cases.

d-w-moore commented 2 months ago

We should handle this better... but where?

So then, how do we best address the issue under the default parser?

artificially suppress the exception when the AVU turns out to have made it into the catalog

if the exception happens, catch it, remove the AVU from the catalog, and rethrow

print or log a warning that the metadata operation succeeded despite error, and recommend using QUASI_XML

@trel: If we're forced to make a choice, and making QUASI_XML the universal default seems too risky, consider:

import contextlib,irods.test.helpers
from irods.message import *
@contextlib.contextmanager
def xml_mode(s):
    try:
        ET(getattr(XML_Parser_Type,s)) # change 'default choice' for this thread temporarily (under with block)
        yield
    finally:
        ET(None)  # return to global default for this thread temporarily (under with block)

# Application: 
with irods.test.helpers.make_session() as session:
    # ... # all activity in current thread will be done using STANDARD_XML parser

    with xml_mode('QUASI_XML'):
        # ... # all activity in current thread will be done using QUASI_XML parser 

    # ... # all activity in current thread will be done again using STANDARD_XML parser

I suggest that we provide such a utility function as this xml_mode above, making it part of PRC, which users can avail themselves of in temporary and specialized circumstances (for example setting AVU's with unprintable characters) ....

They can do so, just by using that function in a with - statement.

I think that is the solution I favor.

trel commented 2 months ago

this fourth option, an xml_mode() function... does seem very flexible and good since we do not know how people may prefer to hold this.

I also think that we should consider how we can make add/set possibly only take proper utf-8 strings in the first place.

korydraughn commented 2 months ago

I like the xml_mode() option.

Have we ever ran the full test suite using the QUASI_XML parser? If the QUASI_XML parser is more in line with the iRODS XML encoding, then it sounds like it should be the default.

d-w-moore commented 2 months ago

I like the xml_mode() option.

Have we ever ran the full test suite using the QUASI_XML parser? If the QUASI_XML parser is more in line with the iRODS XML encoding, then it sounds like it should be the default.

I kind of agree about making it the default, though I am inherently more scared of it. Which is why I put forth the xml_mode option.

I don't recall ever running the whole suite under QUASI_XML although I have run meta_test and data_obj_test at the bench, and it performed just fine.

korydraughn commented 2 months ago

Agreed. It makes sense, but we're not going to switch the default for v2.x. However, knowing whether the test suite passes/fails with the QUASI_XML parser would be good to know.

Please add that to your TODO list as a medium priority item and let's get xml_mode implemented.

trel commented 2 months ago

xml_mode is now #586 and will be in 2.x

irods / python-irodsclient

non-printable characters make it into metadata, but the response breaks the XML parser #582