can't handle binary data

huancz commented 7 years ago

testcase:

dn: cn=Barbara Jensen, ou=Product Development, dc=airius, dc=com
objectGUID;binary:: x7HAMN/oRNyn4w==

shift() result:

{ type: 'record',
  dn: 'cn=Barbara Jensen, ou=Product Development, dc=airius, dc=com',
  attributes:
   [ { attribute:
        { type: 'attribute',
          options: [ 'binary' ],
          attribute: 'objectGUID' },
       value: { type: 'value', value: 'Ǳ�0��Dܧ�' } } ] }

toObject doesn't make it any better:

{ dn: 'cn=Barbara Jensen, ou=Product Development, dc=airius, dc=com',
  attributes: { 'objectGUID;binary': [ 'Ǳ�0��Dܧ�' ] } }

The above ldif output format is real life data from running ldapsearch against Microsoft AD, I only changed the dn and objectGUID value to protect the innocent. The ;binary option is not added automatically by AD, more on that below.

Guessing which attributes are truly binary and which can be decoded to string is a hard problem (read - probably no better solution that enumerate binary exceptions, spec doesn't say anything about it). Another product (node-LDAP) solves this by letting application code to "tag" attributes with ";binary" option when searching. AD server doesn't care and returns the same data in both cases, but parser library can use such tag and return the attributes as Buffer. Either Buffer or original base64 string would be fine, but binary data decoded into javascript string is definitely not fine.

Unfortunately I'm not in the position to learn peg and send a PR.

Thanks

tapmodo commented 7 years ago

Hi! It appears it does detect and decode the binary data as expected. When I was writing the parser I wasn't sure what was the best way to handle this. My thinking was something as simple as a leading or trailing space can cause the value to need to be encoded, and I wanted to make using the data as simple as possible. However, I didn't have a lot of real world examples of how binary values are used (such as in AD). I can see why it is not desirable in this case.

In the first example output above, would it address the issue if, in addition to the value property, it also included encoded and buffer properties? Or perhaps value could be the Buffer, and include encoded and decoded values? Or just have the value be a Buffer and leave it at that?

Thanks for filing this issue! Sorry I didn't see it sooner. If you can let me know your thoughts on what you would like to see/get back in this case, that would be helpful in coming up with a solution.

huancz commented 7 years ago

It appears it does detect and decode the binary data as expected.

I'm not 100% sure if I understand this, just so we are on the same page here: I agree that base64 encoding is detected and decoded into Buffer correctly.

Ldap has many reasons why it would send value as base64. I don't know if the starting space can cause it like you mention, but the value being an UTF-8 string certainly will. Or UTF-16. Or being an image, or GUID or other binary data. Since it doesn't send mime-type or similar information, when the library detects base64, all it can SAFELY do is return a Buffer.

But your proposed solution seems fine. Interpreting the value as UTF-8 string is good default for maybe 90% of cases and it means simpler interface. If you give client application access to the decoded Buffer too, all it means it that the library is doing more work and using more memory than necessary. Neither should be a problem, ldiff data is usually small.

As to the method which to choose... Returning two representations of the same value doesn't work too well with toObject, the returned object would be much uglier. Returning only Buffer objects breaks backwards compatibility in another way.

Perhaps define the "not yet well-defined, leave true" option of toObject. decode=false would mean that some of the returned array members could be Buffer objects. Unless I'm forgetting something, it would mean that old applications will work the same as they did before, and those who need binary will have the option to set it to false and check Buffer.isBuffer before using the value.

tapmodo / node-ldif

can't handle binary data #2