phorward / unicc

LALR parser generator targetting C, C++, Python, JavaScript, JSON and XML
MIT License
56 stars 9 forks source link

binary literal support #13

Closed mgood7123 closed 3 years ago

mgood7123 commented 5 years ago

would it be possible to add a template to parse raw binary input/output, such as when developing disassemblers or assemblers (in which there is no EOF for raw binary unless a specific binary sequence represents the EOF)

note: one solution may be to convert the entire file/input into a binary literal string, as so extremely minimal modification to the C template needs to be made tho i have not tested this

phorward commented 5 years ago

Hi there, and thanks for your request.

The C and C++ targets are very flexible regarding input processing. There is a define called UNICC_GETINPUT which can be defined to any function that emits any kind of character. The EOF can be (dynamically) handled by setting the parser control blocks (pcb) eof member. Please see the User's Manual, section 5 for further information.

Currently, UniCC only allows for character-based input processing, and the scanner is called by the parser. A parser that is called from the scanner (so called "push parsing") currently is not implemented.

Could this help you or do can you provide a practical example for the case you want to implement?

phorward commented 5 years ago

This is how a push-parser might look like:

foreach( token in tokens )
{
    if( push_parse( token ) != PARSER_STATE_NEXT )
        break;
}

if( push_parse( EOF ) == PPPAR_STATE_DONE )
    printf( "Success!\n" );

Is this what you're looking for?

mgood7123 commented 5 years ago

Like...

// small example of detecting alphabetical characters in binary
//eof shall be string terminator, 00000000 in binary or '\0'
ascii_prefix = 011000 // since we only use 3 alphabetical characters we can use a prefix that leaves only 2 bits available
a = ascii_prefix 01 /* 01100001 */ { puts("received letter a"); }
b = ascii_prefix 10 /* 01100010 */ { puts("received letter b"); }
c = ascii_prefix 11 /* 01100001 */ { puts("received letter c"); }
// ...
$alpha = a b c //...
phorward commented 5 years ago

Thanks for your reply. I still think your problem can be solved with either the UNICC_GETINPUT function, or with a push-parsing solution. A more concrete use case for your problem might help me to better understand how your problem can be solved best.

mgood7123 commented 5 years ago

The primary use would be for disassembly, for example (a very simple example)

010011100101011 DO_FUNCTION 0100100001111001 DO_FUNCTION

And so on

mgood7123 commented 5 years ago

for example

prim : 0100111001010 DEC

DEC : 000 FUNCA | 001 FUNCB | ... | 111 FUNCI

phorward commented 5 years ago

OK, now I'd understand your problem.

This might take some time to implement - both the support of external tokens and a push-parsing approach are necessary to provide this feature. Are you patient with the implementation and interested in testing it when ready?

mgood7123 commented 5 years ago

Yes

mgood7123 commented 5 years ago

how is the implementation coming along?

phorward commented 5 years ago

how is the implementation coming along?

Hi, I'm still working on UniCCv2 but if you need it quite soon I can push it into 1.6. Would it be enough to associate tokens with individual external IDs (integer IDs)? E.g. so that DEC becomes 1, f.e.?

mgood7123 commented 5 years ago

how is the implementation coming along?

Hi, I'm still working on UniCCv2 but if you need it quite soon I can push it into 1.6. Would it be enough to associate tokens with individual external IDs (integer IDs)? E.g. so that DEC becomes 1, f.e.?

i dont know enough about UniCC internals to say if it would be "enough to associate tokens with individual external IDs", but as long as it works i dont really case how it is implemented really, it just needs to be able to parse binary as either strings (eg "10010" or as raw binary (eg 10010)

the only requirement is that it accepts raw binary and either parses it as is or converts it to a string then parses it

though obviously in the case of raw > string it will need to convert on input, eg 10010 > "1", "0", "0", 1", "0", otherwise it may just hang as it tries to read to EOF (even though it has none) then convert the entire input to string which it will never do as it never receives EOF

then again an EOF might just be interpreted as specific binary sequence, such as an HALT instruction, though that rises the possibility to store for example, a 5 GB binary string or greater if HALT exists... far, far, far beyond the execution code, as it would be extremely specific to the code being parsed, normally existing in something intended to halt all binary code execution, such as when powering off the machine or similar depending on the usage case

phorward commented 3 years ago

Will close this now. UniCC will be abandoned.