pingidentity / ldapsdk

UnboundID LDAP SDK for Java
Other
338 stars 81 forks source link

ColumnFormatter cannot handle special national characters #89

Open pbando opened 4 years ago

pbando commented 4 years ago

When an attribute contains special national specific characters (ü,ö,á etc) and using the ldapsearch with --outputFormat csv these character are stripped of. I.e.
Using CSV as outputformat: Çavuşoğlu -> avuolu Ülkütan -> lktan. Using JSON as outputformat: Çavuşoğlu -> ăavu Ülkütan -> lkŘtan"

This makes the use for --outputformat useless in coutries using special characters (i.e. almost all EU countries) and wants to store names of users.

dirmgr commented 4 years ago

I’ve confirmed that there is a bug in the way that the LDAP SDK generates CSV-formatted values. It was inadvertently dropping any character outside the printable range of ASCII characters (that is, less than space or greater than tilde). I’ve just committed a fix for this issue, and those characters will now all be preserved (although a string containing any such character will be enclosed in double quotes).

I also discovered a bug in the way that it was handling strings that included the double quote character. It had previously escaped the double quote by preceding it with a backslash (so a single double quote character was formatted as “\"”), but RFC 4180 states that it should be escaped by preceding it with a double quote character (like “""”). This has also been fixed, although it is possible to switch back to backslash-based escaping if necessary for backward compatibility.

However, my testing did not uncover any issues with JSON handling of non-ASCII characters, and there are already tests in place to ensure that such characters are handled properly. The behavior that you’re seeing, in which some characters are swapped out for others, suggests that it’s an issue with the way that strings are being encoded. LDAP expects strings to be encoded in UTF-8, but it may be that the client that stored those values used a different encoding. What does the ldapsearch output look like when using the default LDIF output format?

pbando commented 4 years ago

Thank you for th quick response. The ldapsearch returns the base64 format: username:: w4dhdnXFn2/En2x1

Note that the: Attribute values = entry.getAttribute(attribute); String listString = String.join("|", values.getValues());

Properly gives back the string and listString is correct. Server used: PingDirectory.

Are there any plans to support of retruning the multivalued objects in ColumnFormatterLDAPSearchOutputHandler with e.g pipes as I use it above, not only the first one?

dirmgr commented 4 years ago

That’s the correct base64-encoding for the value, so it shouldn’t be a character set issue.

When I use the 8.0.0.1 version of the Ping Identity Directory Server (the latest generally available version) and the version of ldapsearch that it provides, I get what I expect to be the correct behavior for the JSON-formatted output.

Here’s the LDIF-formatted output for the entry that I’m using (minus the comments that try to provide a human-readable representation of the base64-encoded value):

dn: uid=\c3\a7avu\c5\9fo\c4\9flu.\c3\bclk\c3\bctan,ou=People,dc=example,dc=com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetOrgPerson
sn:: w5xsa8O8dGFu
cn:: w4dhdnXFn2/En2x1IMOcbGvDvHRhbg==
givenName:: w4dhdnXFn2/En2x1
uid:: w6dhdnXFn2/En2x1LsO8bGvDvHRhbg==

And here’s the JSON-formatted output:

{ "result-type":"entry",
  "dn":"uid=\\c3\\a7avu\\c5\\9fo\\c4\\9flu.\\c3\\bclk\\c3\\bctan,ou=People,dc=example,dc=com",
  "attributes":[ { "name":"objectClass",
                   "values":[ "top",
                              "person",
                              "organizationalPerson",
                              "inetOrgPerson" ] },
                 { "name":"sn",
                   "values":[ "Ülkütan" ] },
                 { "name":"cn",
                   "values":[ "Çavuşoğlu Ülkütan" ] },
                 { "name":"givenName",
                   "values":[ "Çavuşoğlu" ] },
                 { "name":"uid",
                   "values":[ "çavuşoğlu.ülkütan" ] } ] }

If your JSON-formatted output doesn’t appear to be correct, then please provide me with a clear example that demonstrates the problem. It would probably be most helpful if you could provide two versions of the ldapsearch output: one with the LDIF output format and one with the JSON format.

And to answer your last question, I can consider updating ldapsearch to support an alternate output format that tries to combine multiple values into a single CSV field. It would likely be considered a different output format (for example, something like “multi-valued-csv”) so that the current behavior would be preserved for the existing CSV format.

pbando commented 4 years ago

I was using the latest SDK with PingDirectory 7.xx and not the inbuilt PingDirectory version.

dirmgr commented 4 years ago

I just repeated the above test with the 7.0.0.0 release of the Directory Server, using both the ldapsearch provided with the Directory Server and the one provided in the 5.1.0 release of the LDAP SDK, and don't see any issue with JSON. If you could please provide the LDIF representation of an entry and an exact ldapsearch command that demonstrates the problem, that would be very helpful.

dirmgr commented 4 years ago

FYI, I just committed an update to ldapsearch to add support for multi-valued-csv and multi-valued-tab-delimited output formats. If you use those output formats, any attribute with multiple values will have those values separated by the vertical bar (|) character. The existing csv and tab-delimited formats still only include the first value for each attribute.

pbando commented 4 years ago

Great! Thank you for the info and the work.