Open cdmistman opened 4 years ago
I'm possibly open to this, but I don't think I have the bandwidth to oversee this at the moment. I'm really trying to focus on internal improvements right now.
That makes sense. I've started writing a PR for this but I'm still familiarizing myself with the internals. If anybody has any suggestions, I'm open to input. I'm thinking of using a Split
to iterate over the Delim
s, with an internal Option
to temporarily store a Text
for the next next
call if there is a jump of larger than 1 char
I'd like to leave a use case. Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation. It's easier to define what looks like a "word" than the other, so in Python I can use a regex split on the thing I actually want to keep and put capturing parens so that it is included in the results:
>>> import re
>>> splitter = re.compile("([a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?)")
>>> splitter.split('He said, "I\'d like to eat cake!"')
['', 'He', ' ', 'said', ', "', "I'd", ' ', 'like', ' ', 'to', ' ', 'eat', ' ', 'cake', '!"']
After splitting, I can, for instance, modify the "word" parts however I like and then reconstitute the original string by joining all the parts on the empty string.
Any chance this might move forward?
Starting from Rust 1.51, there is a stable split_inclusive
method on strings that can work with arbitrary patterns. At least on nightly, you should be able to use pattern
feature of regex
crate and pass Regex
instances into str.split_inclusive()
calls.
Any chance this might move forward?
As I said above, right now my focus is on internals. I don't have the bandwidth to mentor this. With that said, adding such an API might not require much mentorship. It's possible that if someone submits a PR, I might be able to get it merged if doing this is as as "simple" as I think it is.
Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation.
FWIW, I believe your use case would probably be better solved by Unicode word segmentation. The unicode-segmentation
crate has exactly what you want I think. The bstr
crate also has a words_with_breaks
method that works on &[u8]
. It is even implemented with a regex! Although, it does not use the regex crate.
Well, I'm so glad I asked. The unicode-segmentation was exactly what I needed, so thanks!
@shner-elmo I think if you want to work on it then go ahead! You'll probably want to implement the core logic in the regex-automata::meta
module, and define it as a method on meta::Regex
: https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html
I don't have this issue paged into context at the moment, so I'm not sure if there are any gotchas to look-out for.
@BurntSushi Thanks for the encouragement! I created a pull request and I'm looking forward to hearing your thoughts about it.
This issue has been brought up before in #285 and #330 but I think it might be worth revisiting.
I think it might be useful to introduce a
full_split
method on aRegex
. This would behave similar to the currentsplit
method, but would also return values that match the regex. The iterator would return an enum for every iteration, either aDelim
(match) or aText
(non-match).This could have a few helpful applications. In #330 the author suggested they were using it in some kind of calculator. Personally, I would use this for tokenizing. In the same issue, there was a suggested fix, but I think it might be helpful to include it into this crate officially.
I've based the names on OCaml's own regex api (seen here)