google-research / dex-lang

Research language for array processing in the Haskell/ML family
BSD 3-Clause "New" or "Revised" License
1.59k stars 107 forks source link

String/Sequence Processing Standard Library #466

Open oxinabox opened 3 years ago

oxinabox commented 3 years ago

I though I would make a list of functions we should have, for working with strings, or more broadly speaking sequences. I think any item from this list would make a good PR/series of PRs for a new contributor.

The key thing about sequences (as I am using the word here) is that they tend to operate not on individual elements of lists but on sequential elements.

A lot of what I suggest we should have could be built-ontop of the parser-combinator standard library. And/or generalized to take Parser Unit rather than a List

I think we should have the following functions:

Containing substrings

These should be of the form "does the first argument [contain/starWith/endwith] the second". So we can write "abcd" `contains` "bc" etc

Split and Join

These are splitting the second sequence at each position the first sequence occurs. And joining the sequence of sequences by inserting the first argument everywhere in between. Haskell calls split, splitOn (and uses split to mean splitting on characters not sequences). Could do that. Haskell also calls join intercalate which seems unnecessarily obtuse; but maybe if we want to save join for relational algebra like uses.

Replacement and Removal

srush commented 3 years ago

Would the right way to implement these be by using concat as well? Trying to understand the best programming model for this kind of thing.

https://github.com/google-research/dex-lang/discussions/471

apaszke commented 3 years ago

Dex strings are very similar to Python strings. They're immutable blocks of characters, and as such x + y necessarily creates a new block of memory and copies x and y over. This is why doing sum(listOfStrings) is generally a bad idea, both in Python and in Dex, because you easily run into a O(n^2) complexity territory. concat is a special (read as ugly) implementation of that idiom with O(n) complexity, but we're hoping that we can have a nice and fast API via withAccum StringAppend ... some time soon. This would be somewhat similar io.StringIO in Python.