utelle / wxpdfdoc

wxPdfDocument - Generation of PDF documents from wxWidgets applications
http://utelle.github.io/wxpdfdoc/
Other
70 stars 27 forks source link

don't use wxWidgets overload of unsigned char*() operator #65

Closed stkruse closed 4 years ago

vadz commented 4 years ago

ToStdWstring() didn't exist in 2.8, so you need to explicitly use std::wstring(s.wc_str(), s.length()) there.

OTOH I think you should be using utf8_str() in any case, and not UTF-16/32 anyhow (especially because the exact encoding of wchar_t is platform-dependent).

utelle commented 4 years ago

Your patch looks a bit weird. First, I fully agree with @vadz that method utf8_str should be used here, to get full generality and compatibility. Second, I think it is necessary to determine the length of the string produced by utf8_str and then use it for the call to the underlying Encrypt. Otherwise, some string data will be ignored.

stkruse commented 4 years ago

Oh, yes, i agree that length shouldn't be the length of the wxString... This happens if testing and publishing happens in different source files. It should be the length of the unsigned char array.

I still wasn't aware of utf8_str function, I'll rework my patch.

utelle commented 4 years ago

I fear the fix for this issue will not be as trivial as simply applying utf8_str. The reason is that the resulting string coming from the internal Encrypt method will almost never be an utf8 encoded string. That is, one will have to develop a method to convert the string to an 8-bit data buffer and vice versa in a lossless way.

vadz commented 4 years ago

There are always From8BitData() and To8BitData() methods. But I've never looked at this code, so I might be unaware of some problems... From a very high level point of view, it would make sense to encrypt UTF-8 string data as 8-bit data and decrypt 8-bit data as UTF-8.

utelle commented 4 years ago

Methods From8BitData() and To8BitData() use ISO8859-1 for their internal conversion. That is, not the full Unicode character set is covered. For characters not in ISO8859-1 the conversion from wxString to 8-bit data would fail.

As you described the route to go will be to convert to UTF-8, and then interpret the resulting string as 8-bit data and vice versa.

stkruse commented 4 years ago

Sorry for the late respond, but my access to the internet is currently limited and commenting from mobile devices is somehow exhausting. In my opinion the encrypt method shouldn't care about the encoding. It should just take the underlying data and call the encryption algorithm. Therefore I looked for a '.data()' function like the stl provides, but didn't find an equivalent. That's why I came up with ToStdWString() Of course, after the decryption you've to know the encoding of the original string to interpret the data in the correct way, but this shouldn't be part of the encryption. Maybe wxString::wx_str() is the correct way to get the data, but I've to check the implementation.

Edit:

As proposal:

  const wxStringCharType* wstr = str.wx_str();
  unsigned int len = str.Length() * sizeof(wxStringCharType);
  unsigned char* data = reinterpret_cast<unsigned char*> (const_cast<wxStringCharType*>(wstr));
  Encrypt(n, g, data, len);

I know, const_cast is ugly, but I got lost in wxString documentation and didn't find a better way to manipulate the raw data directly without making a copy.

utelle commented 4 years ago

In my opinion the encrypt method shouldn't care about the encoding. It should just take the underlying data and call the encryption algorithm.

In principle, you are right. And in fact the underlying method Encrypt taking unsigned char* data works in that manner. However, the method Encrypt taking wxString data could be called a convenience method encrypting a wxString delivering the result as wxString in place.

That is, it is necessary to transform the input wxString into unsigned char* and - after encryption - unsigned char* into wxString in a portable, lossless way. (Yes, the current implementation doesn't fulfill that requirement in general.)

As @vadz mentioned wchar_t is platform-dependent. Therefore using the internal data representation based on wchar_t will not solve the problem in a portable way.

Maybe resorting to wxUniChar for getting a representation for each individual Unicode character of the wxString could be used for the conversion process. This isn't a very elegant solution either, but certainly it would work.

stkruse commented 4 years ago

My point is that I don't want to perform any transformations. I just want to treat the array encapsulated in the wxString as unsigned char*. It doesn't matters if the type inside the wxString is a char*, awchar_t* and how wchar_t is implemented on a specific platform (it has at least the size of a char).

This should work on any platform. The only problem, I can imagine is when you encrypt and decrypt on different platforms but this is a general problem when you exchange wxStrings between platforms with different wchar_timplementations.

Of course, there are limitations on the resulting string, because you probably won't get encoded characters after the encryption. But, that's the nature of string encryption.

In fact, there is no guarantee that a wchar_t fits inside a wxUniChar (sure, on every system I know it would) and currently I'm not sure about the part of resorting the wxUniChars back to the wxString since a wxUniChar is a wxUInt32.

vadz commented 4 years ago

My point is that I don't want to perform any transformations. I just want to treat the array encapsulated in the wxString as unsigned char*.

This is a bad idea precisely because the internal representation differs across platforms. There is absolutely no reason for the generated PDF to be different depending on the platform where you generate it.

You absolutely should use utf8_str().

utelle commented 4 years ago

The following implementation works for Unicode strings as intended:

#define LOAD32_LE(p)            \
  ( ((uint32_t)((p)[0]) <<  0)  \
  | ((uint32_t)((p)[1]) <<  8)  \
  | ((uint32_t)((p)[2]) << 16)  \
  | ((uint32_t)((p)[3]) << 24)  \
  )

#define STORE32_LE(p, v)        \
  (p)[0] = ((v) >>  0) & 0xFF;  \
  (p)[1] = ((v) >>  8) & 0xFF;  \
  (p)[2] = ((v) >> 16) & 0xFF;  \
  (p)[3] = ((v) >> 24) & 0xFF;

void
wxPdfEncrypt::Encrypt(int n, int g, wxString& str)
{
  unsigned int len = (unsigned int) str.Length();
  unsigned char* data = new unsigned char[4*len];
  unsigned int j;
  wxUniChar uc;

  for (j = 0; j < len; j++)
  {
    uc = str.GetChar(j);
    STORE32_LE(data+(4*j), (unsigned  int) uc);
  }
  Encrypt(n, g, data, 4*len);
  for (j = 0; j < len; j++)
  {
    str.SetChar(j, LOAD32_LE(data+(4*j)));
  }
  delete [] data;
}

In principle, using for example the wxMBConvUTF32LE class, one could come up with a more elegant implementation, but unfortunately the used encryption algorithm easily produces 4-byte code points which are not legal Unicode code points. The above implementation works, because the validity of the code points is not checked, while wxMBConvUTF32LE does check for validity.

Of course, it would be best to find a solution which conforms with Unicode at all steps. However, I have not yet found a suitable algoritm.

utelle commented 4 years ago

After thoroughly analyzing the use of the problematic method Encrypt with wxString parameter, I came to the conclusion that I will most likely remove this method from the wxPdfEncrypt API alltogether in a future version. wxPdfDocument uses this method only in one single place internally (where the current implementation works, because only ASCII strings need to be decrypted).

If you need to implement a wxString encryption in your own code, you may still use the remaining Encrypt method. However, implementing 2 different functions will be the easiest approach - one for encrypting and one for decrypting. The former could use the UTF8 encoded string as input to the Encrypt method and use From8BitData on its output, the latter would do it vice versa.

Since I will definitely not accept the patch as currently presented, I close this PR for now.

Please feel free to reopen, if necessary.

stkruse commented 4 years ago

All right. I switched to the underlying encryption method working on unisgned char* anyway. I'd thought about a solution with two functions as well but since it wouldn't fit in the current wxpdfdoc api I didn't present such a solution here.

utelle commented 4 years ago

Right, for wxPdfDocument such functions are not needed.

However, for your own implementation I would strongly recommend to not use the approach presented in your PR, because it has the same drawback as my proposal with using UTF32: the encrypted string typically contains invalid Unicode code points. Currently wxString does not perform any validation, therefore the code "works". But relying on this behaviour means asking for trouble in the long run.

In former times passwords were restricted to ASCII characters - and for those the string encryption method in wxPdfEncrypt worked well. Nowadays with arbitrary Unicode passwords one has to implement string encryption differently.

stkruse commented 4 years ago

In fact, I don't need to convert the encrypted string to a wxString. I did the conversion only to present a possibility which should work like the old Encrypt function of wxpdfdoc. I can use the encrypted bytes and after decrypting it contains valid Unicode points again.

Thank you very much for your effort.