More string processing functions

CelticMinstrel commented 10 years ago

Waterbear's string processing functions are almost nonexistent. There's no substring functions, no contains, no trim, no find and replace, no capitalization manipulators. I'm sure we could think of many other useful functions it could offer, too (levenshtein distance?).

CelticMinstrel commented 10 years ago

Started adding stuff

I'm a bit torn on string comparison functions; the code behind them is identical to that behind the numeric comparisons, so it would be possible to generalize those to be used for strings as well; on the other hand, there are good reasons to keep them separate. (For example, if kept separate, you can't accidentally compare a string and a number.)

dethe commented 10 years ago

Let's keep separate for now. For one thing, string inputs quote their literal arguments and number inputs don't.

CelticMinstrel commented 10 years ago

The above branch has rudimentary regex-replace now, but I also started another branch for more advanced regex operations; basically wrapping the "string.replace(pattern, function)" functionality. It's rough around the edges still, and it has also exposed some bugs related to local blocks. Still, do you think this is too complicated, or should I keep working on it?

dethe commented 10 years ago

I'm in favour of more blocks for now, to see what kind of things we can do with Waterbear. Complicated ones may get hidden by default when we implement tags.

CelticMinstrel commented 10 years ago

Other than better regular expressions, do you have any more suggestions for string processing blocks?

tyhoff commented 10 years ago

Only thing I can think of are a few operations that would return booleans, such as Java String functions such as contains, matches, startsWith, endsWith.

http://docs.oracle.com/javase/7/docs/api/java/lang/String.html

CelticMinstrel commented 10 years ago

I already added "contains" and "matches"; "startsWith" and "endsWith" would be pretty trivial.

CelticMinstrel commented 10 years ago

So, about that regex branch... any comments or suggestions? Particularly the "string replace" context block, which is a bit strange.

dethe commented 10 years ago

I agree that the regex context looks a bit odd. It doesn't actually say it is a string replace: in string [ ] for each substring matching pattern [ ] according to these rules:. And since the rules don't show in the menu block, the phrase looks incomplete. The auto-numbering of locals is particularly confusing here (I get match 26, offset 26, prefix 26, suffix 26, string 26, submatch [ ] of match 26 which makes them all look like 26 is some kind of important index (at least to me it does).

This is an interesting experiment because we haven't really tested out these kind of specialized contexts with a lot of locals before. I'm tempted to add it just to see what kind of brainstorming it triggers about better ways for doing this. On the other hand, I find it confusing enough that I'm not at all sure how to use it.

Overall it makes me want to take a step back and reframe the problem. What do regular expressions do? What are they for? What would be the natural way to do those things in Waterbear? Perhaps we create regular expressions underneath for the implementation, but the block view looks nothing like regexes.

The OMeta language is interesting for this, because it combines in one syntax: regular expressions, lexing, tokenizing, parsing, and object-oriented pattern matching. It is used as a way of creating very compact languages and interpreters, but what if we went the other way? Waterbear isn't about compactness, but about readability and discoverability. How could we re-think pattern matching overall to be a better fit for Waterbear, rather than trying to squeeze one specific type of pattern matching (regex) into Waterbear?

CelticMinstrel commented 10 years ago

What are they for? I'd say that regular expressions are for processing strings based on regular patterns.

If you wanted to try adding it to see what sort of ideas it triggers, I'd suggest excluding the replace block and only including the loop-through-all-matches block, which isn't quite as bad in some respects; the replace has the flaw that it won't work as expected if the argument is a literal string.

dethe commented 10 years ago

I guess the other problem is that regex patterns are the very opposite of self-explanatory, which also makes them a bad fit for Waterbear. We already have problems where it is not clear what goes in which block, but having to enter a string which can correctly be parsed as a regex, without any help to speak of, seems like a bad idea.

I had thought at one point that we could have expression blocks which could be assembled into regexes, but I think now that regex patterns are probably too complex for that.

CelticMinstrel commented 10 years ago

Yeah, regex patterns are way too complex for that, especially if we don't have blocks with dynamic socket counts. Regex patterns depend heavily on the invisible concatenation operator, so expressing even a simple regex using blocks in the current Waterbear would take up a lot of space. If you had a single block which took an arbitrary number of inputs and concatenated them all together, it wouldn't be quite so bad, though it'd still be pretty big.

If they could be composed from steps in a context, it wouldn't be as bad, but that sort of thing would need a lot of changes to Waterbear, for example introducing the idea of a context that can only hold certain types of blocks.

I personally think it's good to have them around for more advanced users, though.

waterbearlang / waterbear

More string processing functions #496