martinvonz / jj

A Git-compatible VCS that is both simple and powerful
https://martinvonz.github.io/jj/
Apache License 2.0
7.24k stars 240 forks source link

str_util: support case‐insensitive string patterns #3991

Open emilazy opened 3 days ago

emilazy commented 3 days ago

See the commit message for details. This resolves an old TODO and came up during the rework of https://github.com/martinvonz/jj/pull/3951. I’m not totally sure about the defaults here, but the functionality seems good to have.

Checklist

If applicable:

yuja commented 3 days ago

This is a function argument rather than part of the pattern syntax as it doesn’t make sense for all uses of patterns to support case‐sensitivity

I think it should be a pattern kind so that the same syntax can be used in jj branch list PATTERN command for example. It's weird if these commands took a separate --case-insensitive flag.

(e.g. Git branch settings, where the remote end has no support for case‐insensitive matching),

Perhaps, .to_glob() can be renamed to .to_case_sensitive_glob() and error out if the type was Glob { .., case_insensitive: true }?

and it’s not clear how it would best be integrated into the pattern syntax anyway. The glob and regex crates also separate matching options like this from the pattern itself.

Something like iglob: or icase-glob: seems okay (though I'm not a big fan of these.) For regex:, we won't need iregex:, but could be added for consistency.

@arxanas how do you deal with case-insensitive matches in git-branchless? Just using regex: pattern? https://github.com/arxanas/git-branchless/wiki/Reference:-Revsets#patterns

Case‐insensitivity is on by default; a casual search for commit names or descriptions is unlikely to care about specific case distinctions, the domain part of email addresses is de jure case‐insensitive, and the local‐part is case‐insensitive in practice too. This could be changed at a later date if it turns out to be undesirable default for any of these functions (description("FIXME")?).

Good point about email addresses, but I would make it case-sensitive by default mainly for consistency with command arguments and file patterns.

This currently only handles ASCII case folding, due to the complexities of case‐insensitive Unicode comparison and the glob crate’s lack of support for it.

Sounds good to me.

Thanks!

emilazy commented 3 days ago

Thank you for the review!

I think it should be a pattern kind so that the same syntax can be used in jj branch list PATTERN command for example. It's weird if these commands took a separate --case-insensitive flag.

I was wavering on whether it should be part of the pattern or not, so I can switch back. Although I’m not sure how much use there is for searching branch names case‐insensitively, especially when Git remotes don’t support it. I didn’t bother adding it to the corresponding revset functions for that reason; I didn’t want to make people expect they could treat branch names case‐insensitively in all contexts. It would be nice for it to work in contexts where it does make sense without needing an explicit extension to the interface, though, I agree.

Perhaps, .to_glob() can be renamed to .to_case_sensitive_glob() and error out if the type was Glob { .., case_insensitive: true }?

Sure, that works. It feels kind of awkward for some types of pattern to be rejected in some contexts, but I guess it’ll be inevitable for those code paths to error out if/when regex patterns are supported anyway.

Something like iglob: or icase-glob: seems okay (though I'm not a big fan of these.) For regex:, we won't need iregex:, but could be added for consistency.

Taking inspiration from regex syntax (a dangerous phrase!), what about…

exact/i:foo
substring/i:foo
glob/i:foo

# Short for `substring/i:foo`
/i:foo

Too weird? I like that it clearly separates out the type from the common options, and I think that we probably want something shorter than substring/i: for quick commit searches if it’s not going to be the default, but I admit that /i:foo makes me think of DOS. Maybe there’s something in the region of this we can bikeshed. substring(i):…? I guess we could just make i: short for isubstring:.

Good point about email addresses, but I would make it case-sensitive by default mainly for consistency with command arguments and file patterns.

Makes sense; it makes this less of a backwards‐incompatible change and can always be adjusted later. How about keeping mine() case‐insensitive, since it’s always dealing with email addresses? Not that it really matters, since it’s not likely common to have commits with email addresses differing only in case (though you can find it in Nixpkgs, like every other kind of metadata strangeness in the world).

yuja commented 3 days ago

Taking inspiration from regex syntax (a dangerous phrase!), what about…

exact/i:foo
substring/i:foo
glob/i:foo

# Short for `substring/i:foo`
/i:foo

Too weird? I like that it clearly separates out the type from the common options, and I think that we probably want something shorter than substring/i: for quick commit searches if it’s not going to be the default, but I admit that /i:foo makes me think of DOS. Maybe there’s something in the region of this we can bikeshed. substring(i):…? I guess we could just make i: short for isubstring:.

It's unclear to me where we set a boundary (e.g. cwd-glob: vs glob/cwd: in file pattern), so I wouldn't introduce /{option} syntax. That said, I don't have strong feeling. If people like /i, I won't be against.

i: for isubstring: sounds good to me. If we add a short form, the canonical form could be icase-substring:.

How about keeping mine() case‐insensitive, since it’s always dealing with email addresses? Not that it really matters, since it’s not likely common to have commits with email addresses differing only in case (though you can find it in Nixpkgs, like every other kind of metadata strangeness in the world).

That makes sense to me. Inline comment will help understand why it's case insensitive.

emilazy commented 2 days ago

I’ve switched this over to being part of string patterns, and settled on {kind}-i:…/i:… syntax for now to keep the option clearly separated from the base kind without adding any new weird symbols. It was a little gnarlier than expected because of the default pattern kind being dependent on context (which I didn’t realize before and don’t entirely love, but it is what it is). I’d appreciate it if you could take another look!

martinvonz commented 2 days ago

mine() matches case‐insensitively unconditionally, since email addresses are conventionally case‐insensitive and it doesn’t take a pattern anyway.

Is that already the case or do you mean that we should make it work that way? (It sounds like you're saying the former, but it doesn't look like that's what the code does.)

Ah, you mean that it behaves that way after the commit. We don't really have a convention for what present tense means in commit messages. I use present tense to describe how it behaves before the patch (not saying you have to follow that, just explaining why I was confused).

emilazy commented 2 days ago

I agree that it’s confusing; I’ll reword it. I try to stick to the Git quasi‐standard of imperative present tense, but it can get so awkward when going into detail, so I frequently slip…

(Thank you for the repository invite, by the way!)