Introduce split_inclusive to API

rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

https://docs.rs/regex

Apache License 2.0

3.56k stars 445 forks source link

Introduce split_inclusive to API #681

Open cdmistman opened 4 years ago

cdmistman commented 4 years ago

This issue has been brought up before in #285 and #330 but I think it might be worth revisiting.

I think it might be useful to introduce a full_split method on a Regex. This would behave similar to the current split method, but would also return values that match the regex. The iterator would return an enum for every iteration, either a Delim (match) or a Text (non-match).

This could have a few helpful applications. In #330 the author suggested they were using it in some kind of calculator. Personally, I would use this for tokenizing. In the same issue, there was a suggested fix, but I think it might be helpful to include it into this crate officially.

I've based the names on OCaml's own regex api (seen here)

BurntSushi commented 4 years ago

I'm possibly open to this, but I don't think I have the bandwidth to oversee this at the moment. I'm really trying to focus on internal improvements right now.

cdmistman commented 4 years ago

That makes sense. I've started writing a PR for this but I'm still familiarizing myself with the internals. If anybody has any suggestions, I'm open to input. I'm thinking of using a Split to iterate over the Delims, with an internal Option to temporarily store a Text for the next next call if there is a jump of larger than 1 char

kyclark commented 3 years ago

I'd like to leave a use case. Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation. It's easier to define what looks like a "word" than the other, so in Python I can use a regex split on the thing I actually want to keep and put capturing parens so that it is included in the results:

>>> import re
>>> splitter = re.compile("([a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?)")
>>> splitter.split('He said, "I\'d like to eat cake!"')
['', 'He', ' ', 'said', ', "', "I'd", ' ', 'like', ' ', 'to', ' ', 'eat', ' ', 'cake', '!"']

After splitting, I can, for instance, modify the "word" parts however I like and then reconstitute the original string by joining all the parts on the empty string.

Any chance this might move forward?

RReverser commented 3 years ago

Starting from Rust 1.51, there is a stable split_inclusive method on strings that can work with arbitrary patterns. At least on nightly, you should be able to use pattern feature of regex crate and pass Regex instances into str.split_inclusive() calls.

BurntSushi commented 3 years ago

Any chance this might move forward?

As I said above, right now my focus is on internals. I don't have the bandwidth to mentor this. With that said, adding such an API might not require much mentorship. It's possible that if someone submits a PR, I might be able to get it merged if doing this is as as "simple" as I think it is.

Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation.

FWIW, I believe your use case would probably be better solved by Unicode word segmentation. The unicode-segmentation crate has exactly what you want I think. The bstr crate also has a words_with_breaks method that works on &[u8]. It is even implemented with a regex! Although, it does not use the regex crate.

kyclark commented 3 years ago

Well, I'm so glad I asked. The unicode-segmentation was exactly what I needed, so thanks!

BurntSushi commented 1 year ago

@shner-elmo I think if you want to work on it then go ahead! You'll probably want to implement the core logic in the regex-automata::meta module, and define it as a method on meta::Regex: https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html

I don't have this issue paged into context at the moment, so I'm not sure if there are any gotchas to look-out for.

shner-elmo commented 1 year ago

@BurntSushi Thanks for the encouragement! I created a pull request and I'm looking forward to hearing your thoughts about it.