UTF-8 file support - Githubissues

GoogleCodeExporter commented 8 years ago

Implementing the ability to load a file with UTF-8 strings (w/o signature) 
seems rather easy.

Change:  
return Marshal::PtrToStringAnsi  [...] (only used once in LuaDLL.cpp) 

to:

array<Byte>^encodedBytes = gcnew array<Byte>(strlen);
Marshal::Copy(IntPtr((char *) str), encodedBytes, 0, strlen);
UTF8Encoding^ utf8 = gcnew UTF8Encoding;
return utf8->GetString( encodedBytes );

I could not find any adverse effects and all strings displayed correctly 
in the front end.

Original issue reported on code.google.com by a.wiedm...@raw-consult.com on 28 Apr 2010 at 3:26

GoogleCodeExporter commented 8 years ago

Has anyone succeeded in fixing this issue? I did try the change suggested in 
this post but it didn't do the trick. My issue is that I need to deserialize a 
file written in Lua containing Japanese characters into a .net string, but 
after calling LuaDLL::luaL_loadBuffer, japanese chars are replaced with ???. 
I think this is due to the Marshal::StringToHGlobalAnsi calls and the fact that 
we are using char* everywhere (as opposed to wchar_t*).

Any idea/suggestion?

Cheers

Original comment by simonis...@gmail.com on 14 Jun 2011 at 10:53

GoogleCodeExporter commented 8 years ago

That's totally correct that using luaL_loadbuffer would screw things up. The 
point of my original post was that you could load lua files encoded in UTF8 
e.g. using loadfile in a lua script.
If you want't "full round trip" you will need to change luaL_loadstring and 
luaL_loadbuffer as well. (I think those where the only two really needed)
e.g. luaL_loadbuffer would look somewhat like this:

static int luaL_loadbuffer(IntPtr luaState, String^ buff, String^ name)
{
Encoding^ enc = Encoding::UTF8;

array<Byte>^bytesBuff = enc->GetBytes(buff);
pin_ptr<Byte> pBuff = &bytesBuff[0];

array<Byte>^bytesName = enc->GetBytes(name);
pin_ptr<Byte> pName = &bytesName[0];

return ::luaL_loadbuffer(toState, (char*)pBuff, bytesBuff->Length, 
(char*)pName);
}

Oh btw, i found out there is a String constructor that already does all that 
encoding marshl::copy etc. you would change 
return Marshal::PtrToStringAnsi  [...] 
to
return gcnew String(str, 0, strlen, Encoding::UTF8);

Original comment by a.wiedm...@raw-consult.com on 16 Jun 2011 at 2:51

GoogleCodeExporter commented 8 years ago

Thanks for answering.

Actually I worked on this yesterday and here is my version of luaL_loadbuffer 
that fixes this issue:

static int luaL_loadbuffer(IntPtr luaState, String^ buff, String^ name)
{
  wchar_t *cs1 = (wchar_t *) Marshal::StringToHGlobalUni(buff).ToPointer();
  char *cs2 = (char *) Marshal::StringToHGlobalAnsi(name).ToPointer();

  size_t sizeRequired = ::WideCharToMultiByte(CP_UTF8, 0, cs1, -1,NULL, 0,  NULL, NULL);
  char *szTo = new char[sizeRequired + 1];
  szTo[sizeRequired] = '\0';
  WideCharToMultiByte(CP_UTF8, 0, cs1, -1, szTo, (int)sizeRequired, NULL, NULL);

  //CP: fix for MBCS, changed to use cs1's length (reported by qingrui.li)
  int result = ::luaL_loadbuffer(toState, szTo, strlen(szTo), cs2);

  Marshal::FreeHGlobal(IntPtr(cs1));
  Marshal::FreeHGlobal(IntPtr(cs2));

  return result;
}

cheers

Original comment by simonis...@gmail.com on 16 Jun 2011 at 4:13

GoogleCodeExporter commented 8 years ago

I just remembered something and thought of this thread..
For UTF8 NULL is a VALID char aswell as .NET strings can contain NULL
Try this:

string s = "foo" + "\u0000" + "bar";
byte[] buff = Encoding.UTF8.GetBytes(s);
for (int i = 0; i < buff.Length; i++)
{
    Console.Write("{0:X2} ", buff[i]);
}
Console.WriteLine();

Output: 66 6F 6F 00 62 61 72 

Although this might not really happen most of the time i don't think strlen is 
safe to use here. I knew there was a reason i used bytesBuff->Length :)
There is no such thing a a null terminated UTF8 string

Original comment by a.wiedm...@raw-consult.com on 27 Jun 2011 at 1:27

sharpwind612 / luainterface

UTF-8 file support #15