Preprocessing of Strings, Little more Statistics in Python Decoder

joha2 commented 8 years ago

The function preprocessString in the Parse class is now able to transform a string

@ has-square-divisor-within | ? top | ? x | if (< $top 0) 0 | if (= $x | * $top $top) 1 | has-square-divisor-within (- $top 1) $x

which is in human readable form into a string

@ has-square-divisor-within (? top (? x (if (< (top) 0) 0 (if (= (x) (* (top) (top))) 1 (has-square-divisor-within (- (top) 1) (x))))))

which is parseable easier. Another advantage is that the encoding of this string in the message afterwards is more coherent for the intended recipient, see #15. As a result the following stringToList and evaluateInContext functions could be simplified drastically. :-) The commits right now are in an intermediate debug stage: Therefore there are many trace commands appearing in the source and the message is broken shortly after the beginning.

alanfwilliams commented 8 years ago

So, I'm not comfortable merging something that breaks the message. Do you know why the message is breaking? Is it because something isn't implemented yet, or is it a bug?

joha2 commented 8 years ago

Actually it is not a problem that this branch is not merged, because it is buggy. So after a few commits maybe the message will work again ;-). As already discussed in #14 the message were also broken before, only in another sense ;-) Before I debugged the code the encoding was not right, due to the texty symbols and now the evaluation does not work the right way, because the cosmicos.js script fails at some point during 'compiling' the message. (One of the unarys returns false instead of true)

alanfwilliams commented 8 years ago

Right. So right now you are working on figuring out why the evaluation fails?

joha2 commented 8 years ago

Oh sorry. I wondered where to put that and how to prevent submission. By using .gitignore-File? Yes I try to figure out where the evaluation fails. But I'm affraid that I cannot do much until I know the logic behind the code better. As already mentioned that I wrote some preprocessing step, but where to split up between evaluation and encoding and which is the best intermediate format for the line to be evaluated?

alanfwilliams commented 8 years ago

It is ok. I can revert it before merging. In terms of the best format, what format is it in now when it is processed?

alanfwilliams commented 8 years ago

Also, you can remove the commit yourself with a "rebase": https://help.github.com/articles/about-git-rebase/

joha2 commented 8 years ago

The format now depends on the stage of the evaluation procedure. At the end of the day it gets converted into a function. In the intermediate steps it is a list of list of ... list of strings, list of list of list ... list of different formats and list of list of list ... of functions.

For the commit I could also remove the line from the README.md file. Would this be helpful?

alanfwilliams commented 8 years ago

Yeah, that would be good. Maybe a good debugging tool, although it may increase build time, is to do a evaluation at each stage.

joha2 commented 8 years ago

What do you mean by "do a evaluation at each stage"? Which evaluation at which stage? :-)

alanfwilliams commented 8 years ago

(Note: I am replying by email, I hope this works.) What I mean is like AFAIK, it starts out as a (blah); format, parses it, then takes each part and turns it into the 0,1,2 format. So, my idea is that in each stage of the process a "sanity check" should occur against the know answer value. If you want me to, I could find an example in the source and go through it.

On Jul 20, 2016, at 6:12 PM, joha2 notifications@github.com wrote:

What do you mean by "do a evaluation at each stage"? Which evaluation at which stage? :-)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

joha2 commented 8 years ago

Yes it works! :-) Ah I understood. Something like unit tests. I think this is a good idea, but as long as the final format is not settled, these tests have to be rewritten every time the format or the intermediate stages change.

alanfwilliams commented 8 years ago

That is true. How about for now, just have it print the message at each stage to STDOUT. If someone wants to check it, then can do so by hand. This may not work for the longer bits of the message, but the errors we are encountering right now is in the beginning where things are still small.

joha2 commented 8 years ago

I am not sure how to do this. Actually, the compiling part of the message is very difficult to understand. At the moment there are many trace instructions in the hx code (inserted by me) and you can codify some certain sequence of instructions more or less manually by using a debugging script (see wiki). But this is far away from a unit test or an automatic output of the encoded message in intermediate stages. Perhaps at least for testing purposes it is interesting for you to clone my fork. :-)

paulfitz commented 8 years ago

I'm a bit nervous about this transformation being done at the string level. If I understand right, it requires duplicating a lot of parsing code. How about making the transformation one step later, once the string has been parsed? It should be a more straightforward process in that case.

paulfitz commented 8 years ago

PS thanks for working on this! :-)

joha2 commented 8 years ago

Yes you are right, the parsing code is doubled, but this also due to my lack of experience with haxe ;-). My idea was that you maybe could simplify the later parsing steps by using this simplified format :-). But since we discuss already several encoding schemes in #15 maybe this discussion and the pull request are obsolete.

joha2 commented 8 years ago

I restored main functionality and removed debugging commands in Parse.hx and Evaluate.hx (accepted remote branch). Further I readded my string preprocessing functions to remove $ and /, just in case they are needed ;-). And last but not least, I added the unary operator at the very beginning of the message, just to be consistent (and changed the self-reference part therefore slightly).

paulfitz commented 8 years ago

@joha2 the images you've been adding to the wiki are great. I'd like to propose we store them in #20 rather than in the repo itself, since over time that can make the revision history more bulky than it needs to be. How would you feel about removing wiki_images/COS_entropy_ngram-length.png from this PR and instead using https://cloud.githubusercontent.com/assets/118367/17651672/f7f4788a-6239-11e6-8bc2-ece529eec5c9.png (see https://github.com/paulfitz/cosmicos/issues/20#issuecomment-239694140)

joha2 commented 8 years ago

I think this is OK, since the images are for the wiki only. Since I am relatively new to github I did not know this workaround :-) If there is some binary content which is important for the message itself, it should be added to the repo. How do we remove those images from the repo?

paulfitz commented 8 years ago

Don't worry about removing already-added images from the repo for this PR, we can deal with that separately.

joha2 commented 8 years ago

What do you think: Is the python decoder/analyser at the right place in the source tree?

paulfitz commented 8 years ago

I'd have a weak preference for keeping the src directory for the message source. Maybe add a decoding or analysis directory?

joha2 commented 8 years ago

OK

paulfitz / cosmicos

Preprocessing of Strings, Little more Statistics in Python Decoder #16