kokoye2007 / waitzar

Automatically exported from code.google.com/p/waitzar
Other
0 stars 1 forks source link

Possibly allow dynamic code for Transformations? #105

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Just a thought here: it might be good if users could define a
transformation like so:

"uni2zg09" :
{
  "from-encoding" : "unicode",
  "to-encoding" : "zawgyi-2009",
  "type" : "script",
  "source" : "file.scr",
}

This eases the burden on us to develop transformations for the extreme
number of Burmese encodings out there.

I can think of two ways to do this:

1) Just make "file.scr" a series of Regular Expressions. Each one is
applied in order. Do something like Ko Soe Min does:
http://swtch.com/~rsc/regexp/regexp1.html
Advantages: Fast
Disadvantages: Somewhat difficult to develop (for users)

2) Embed the V8 javascript engine (the license matches). Then, make
"file.scr" a javascript file. 
Advantages: Powerful. Easy to develop (users can test their code in Chrome)
Disadvantages: We'll have to enforce security somehow.

Regarding the size of the V8 engine:

As a lib: 188  KB (super small)
As a DLL: 1.48 MB

With snapshots (for fast startup)
As a lib: 100  MB
As a DLL: 1.69 MB

We would have to put the library itself (no source) into the SVN
repository. There might be issues with compiling on a 32-bit platform and
then porting to a 64-bit one (or it might just run slower...).
Alternatively, we could build two versions and load one as a DLL at runtime. 

I'm not yet sure if the benefit of debugging in a browser outweighs the
possible issues with v8. Traditionally, WaitZar has leaned towards "slim"
solutions. But there are certainly benefits to using V8. For example, we
wouldn't have to link in spirit_json; we'd get a full JSON parser for free.
(The v8 parser would also probably be faster). And, v8 is fairly mature, so
recompiles wouldn't be required all the time --and we could offer the
latest DLLs on-site in case users wanted them. We might even be able to
embed the 188KB static library (for the single-executable solution) and
dynamically link in the snapshot-enhanced DLLs if they're available. We
could even offer both regexes and v8. In short, although regexes sound like
the WZ-style solutions we've used for years, a proper integration of v8
would actually be more in the spirit of WZ, would cut down our code base,
and would ultimately increase WZ's power. 

This issue won't be resolved until 1.9 (which, presumably, won't have too
many new features over 1.8 anyway) ---but I'm starting this bug report now
so we can think on it.

Original issue reported on code.google.com by seth.h...@gmail.com on 7 Mar 2010 at 1:08

GoogleCodeExporter commented 9 years ago
To build v8 with MinGW (which is needed in order to link it properly) you need 
to do the following:

  1) Open Sconstruct and src/Sconscript, and replace "Environment()" with "(Environment(tools = ['mingw']))"

  2) src/v8utils.h needs the following:
   #ifdef __GNUC__
     #include <stdarg.h>
   #endif

  3) platform-win32.cc requires the long form of strncp_s:
   strncpy_s(name_, sizeof(name_), name, sizeof(name_));
   ...instead of:
   strncpy_s(name_, name, sizeof(name_));

Original comment by seth.h...@gmail.com on 16 Jan 2011 at 4:04

GoogleCodeExporter commented 9 years ago
Also need to add libraries:
   ws2_32
   wsock32
   winmm
...to the linker. 

-------------------------------

Might also want to change "-O3" to "-Os" in the Sconstruct file. (Or not, see 
below)

-O3 isn't worth it on GCC 4.5. It might be worth it on 4.6. Since this is a 
library, we might consider -O3. 

Original comment by seth.h...@gmail.com on 16 Jan 2011 at 4:29

GoogleCodeExporter commented 9 years ago
The v8 library needs (seems to need) _WIN32_WINNT set to 0x0501. This would 
bump the minimum Windows version for Wait Zar up from Windows 2000 to Windows 
20003 or XP. 

Original comment by seth.h...@gmail.com on 16 Jan 2011 at 4:38

GoogleCodeExporter commented 9 years ago
The linker might also need "-static" set.... not sure about this one, but 
trying it out.

Original comment by seth.h...@gmail.com on 16 Jan 2011 at 4:46

GoogleCodeExporter commented 9 years ago
NOTE: We can keep the Windows version at 500 (Win2k). 

Original comment by seth.h...@gmail.com on 16 Jan 2011 at 5:02

GoogleCodeExporter commented 9 years ago
This makes the EXE 5MB. Might have to leave this functionality in the DLL.... 
but I'd really like it for reading JSON without the need for Boost.

Original comment by seth.h...@gmail.com on 16 Jan 2011 at 5:05

GoogleCodeExporter commented 9 years ago
Some fun notes:
  1) A mundane (recursive descent?) library for parsing JSON, json-cpp, is nice and tiny (5~8 files), has some notion of "comments" already, and seems pretty fast. 
  http://jsoncpp.sourceforge.net/

So, that removes the boost dependency. Now, we can implement any scripting 
language we want. Maybe lua? 

Alternatively, we can implement a language (like LUA) and then load a LUA 
library that parses JSON. 

Original comment by seth.h...@gmail.com on 17 Jan 2011 at 9:05

GoogleCodeExporter commented 9 years ago
Note: LUA is very tiny (~800 kb source), and, though written in C, compiles as 
C++ as well. It is also extremely stable. 

Now reading the syntax of Lua....

Original comment by seth.h...@gmail.com on 17 Jan 2011 at 9:25

GoogleCodeExporter commented 9 years ago
Building Lua in release mode takes 15s. It creates a DLL 248 kb in size. 
Building as a static library creates a .lib which is about the same size. So, 
we'll most likely want to just build a DLL and leave Lua as an extension. 

Original comment by seth.h...@gmail.com on 18 Jan 2011 at 5:34

GoogleCodeExporter commented 9 years ago
We should give the Lua DLL a checksum, so that we can (partially) avoid people 
substituting an older/newer version of Lua. 

I doubt people will try to hack WZ that way, so we can just store these 
checksums in the config files, and have a flag "verify-md5-checksum" that's 
On/Off.

Original comment by seth.h...@gmail.com on 18 Jan 2011 at 5:39

GoogleCodeExporter commented 9 years ago
Lua requires some hacking to get Unicode working:
  http://lua-users.org/wiki/UnicodeIdentifers
  http://lua-users.org/wiki/ValidateUnicodeString
  http://luaforge.net/projects/sln/

Seems like it's not too hard. (Have to check if slnunicode supports pattern 
matching).

Original comment by seth.h...@gmail.com on 18 Jan 2011 at 6:54

GoogleCodeExporter commented 9 years ago
Note that Lua doesn't have full regexes; rather, it has more of character-level 
patterns. This is bad for, e.g., kinzi and stacked letters.

One option I overlooked was using json-cpp to parse the JSON, then compile V8 
as a DLL (using VC++) and loading it dynamically. We could write some kind of 
scaffolding function which try{}catch{}'d the entire thing (and compile that 
with VCC), then returned an error code if something went wrong, since MinGW 
won't be able to catch an exception from a DLL.

This might be the best option; I'll write a simple test script for this later. 

Original comment by seth.h...@gmail.com on 19 Jan 2011 at 5:56

GoogleCodeExporter commented 9 years ago
I wrote a test script for json-cpp. Basically, we have to treat the string as 
UTF-8, and only convert it to wchar_t* when we actually need the value. 

Sample (and library) code attached.

Note that we could define JSON_VALUE_USE_INTERNAL_MAP if we want objects to be 
stored as maps instead of vectors. Currently, config files are small, so I see 
no need to do this. 

(This point is irrelevant for now; JSON_VALUE_USE_INTERNAL_MAP will trigger a 
union error in MinGW).

Original comment by seth.h...@gmail.com on 19 Jan 2011 at 9:09

Attachments:

GoogleCodeExporter commented 9 years ago
Replaced Json_Spirit with Json_CPP. Removed Boost as well. Total build time is 
down to 149 seconds.

This also allows us to focus on scripting (Javascript or Lua) using a DLL only.

Original comment by seth.h...@gmail.com on 19 Jan 2011 at 10:38

GoogleCodeExporter commented 9 years ago
v8 accepts either UTF-8 or unsigned short* arrays. Both require some conversion 
from wstrings, so the choice of which format to use will depend on how V8 (or 
Javascript in general) handles Unicode internally. 

Original comment by seth.h...@gmail.com on 20 Jan 2011 at 5:46

GoogleCodeExporter commented 9 years ago
From the mailing list:

> The internal storage format for strings are either ASCII (one byte per char)
> or UTF-16 (two bytes per string). So any UTF-8 string which has only ASCII
> characters is stored as ASCII otherwise UTF-8 is converted to UFT-16. The
> fact that there is no uint16_t version of NewSymbol in the API is mainly
> because no-one has added it.

So, we'll use the UTF-16 variant. 

Original comment by seth.h...@gmail.com on 20 Jan 2011 at 5:54

GoogleCodeExporter commented 9 years ago
Note: on Windows, wchar_t and uint16_t are the same size. So we should be able 
to fast-convert the array.

Maybe with a pointer conversion?
uint16_t* arr = &wstring().c_str()[0];
Is this a bad idea? It's certainly fast.

Original comment by seth.h...@gmail.com on 20 Jan 2011 at 6:01

GoogleCodeExporter commented 9 years ago
It's even easier.
For input into v8:
uint16_t* x = (uint16_t*)(src.c_str());

For output from v8:
wstring myresStr((wchar_t*)*myres);

Note that, somewhere in the code, we should ensure that:
sizeof(wchar_t) == sizeof(uint16_t)
...just to give anyone compiling WZ on a 64-bit system some warning. 

Original comment by seth.h...@gmail.com on 20 Jan 2011 at 6:19

GoogleCodeExporter commented 9 years ago
Interestingly enough, most of the security concerns I was worried about aren't 
in ECMA-script at all, but in the various browser addons. 

For example, the following objects are not defined:
  * document
  * fopen
  * xmlHttpRequest
  * alert

This might turn out to be a decent option after all.

Original comment by seth.h...@gmail.com on 20 Jan 2011 at 6:58

GoogleCodeExporter commented 9 years ago
Wrote a small driver program and compiled it into V8. The symbol is definitely 
there (checked with objdump).

Next up:
  1) Visual Studio: Use LoadLibrar() etc. (no *.lib file) to load the DLL.
  2) MinGW: Repeat
  3) MinGW: Again, with UPX'd dll
  4) Port into Wait Zar proper, with all the config file fun that entails.

Original comment by seth.h...@gmail.com on 21 Jan 2011 at 7:11

GoogleCodeExporter commented 9 years ago
1) works, after a dash of extern "C"

Original comment by seth.h...@gmail.com on 22 Jan 2011 at 5:41

GoogleCodeExporter commented 9 years ago
2) Done. 

Original comment by seth.h...@gmail.com on 22 Jan 2011 at 7:57

GoogleCodeExporter commented 9 years ago
3) Done

Time to think about config fun-ness. 

Original comment by seth.h...@gmail.com on 22 Jan 2011 at 8:10

GoogleCodeExporter commented 9 years ago
Note that loading the DLL into memory takes about 3.5MB. Not a big deal (since 
its compact & only 700kb on disk), but this enforces the idea that we'll need 
the ability to disable DLLs. 

For configs, I'm thinking of something like this:
"languages.myanmar.tranformations" : 
{
  "uni2ayar" : 
  {
    "from-encoding" : "unicode",
    "to-encoding" : "ayar",
    "type" : "javascript",
    "source-file" : "uni2ayar.js",
  }
}

That handles the transformation. (Note that uni2ayar.js is located in the 
current directory, as with most config settings). If "javascript" is disabled 
(or the DLL is missing) then this simply discards the transformation.

Now, for the DLL loading, we should probably have a directory called 
"config/Common" which contains them. All DLLs must load using a path relative 
to Common (no sub-directories either). DLLs may have an MD5 checksum, and may 
be disabled. We'll need a new top-level directive (like "languages" and 
"settings"), and these should be resolved _first_ in resolvePartialSettings(). 
Something like this:

"extensions" : 
{
  "javascript" : 
  {
    "library-file" : "v8_wz.dll",
    "enabled" : "yes",
    "md5-hash" : "6CC7C73271E31F5D4AD48BCACD27A4EB",
    "check-md5" : "yes"
  },
  #More
}

Note that, since Loading and Unloading a DLL leaves a good deal of memory 
remaining (1MB), then it makes sense to simply load the DLL (unless disabled), 
test for the conversion function, then leave it open.

After we implement this option, the first thing to do will be to check for 
memory leaks. 

The second thing to do will be to implement a "fallback" function. For example, 
disabling javascript might disable Ayar, but it shouldn't disable, say, 
Burglish. 

Original comment by seth.h...@gmail.com on 24 Jan 2011 at 4:43

GoogleCodeExporter commented 9 years ago
Done. There's some minor problems reporting errors (and fallbacks) but I think 
that will require re-writing the config parser a bit. 

Original comment by seth.h...@gmail.com on 25 Jan 2011 at 7:46