Closed alexvonme closed 5 months ago
Hi guys. The regexp might be difficult to fox by your own.
What about reporting a bug issue to vale Google rules, pointing the issues you found with them
Found this
https://github.com/splunk/vale-splunk-style-guide/blob/main/styles%2FSplunk%2FLatin.yml
Please have a look at this vale repository
Found this
https://github.com/splunk/vale-splunk-style-guide/blob/main/styles%2FSplunk%2FLatin.yml
Please have a look at this vale repository
Yes, I saw that one, but it's not necessary in our case. SUSE doesn't seem to care much for most of them and seems to explicitly allow via
. We only need to fix i.e.
and e.g.
@ccoVeille How about this:
'\b(?:eg|e\.g\.|i\.[\s]e\.)[\s]': 'for example '
'\b(?:eg|e\.g\.|i\.[\s]e\.)[?:,|.|;]': 'for example'
'\b(?:ie|i\.e\.|i\.[\s]e\.)[\s]': 'that is '
'\b(?:ie|i\.e\.|i\.[\s]e\.)[?:,|.|;]': 'that is'
It could be shorter, yes. However, this is easy to read and configure, while catching all possible errors.
The only false positive problem is the .ie TLD, which we'll need to add an exception for. Other than that, I'm not worried about catching IE as an abbreviation of Internet Explorer, since its obscurity ensures most people nowadays refer to it by its full name rather than its abbreviation.
Your regexp can be simplified
'\b(?:eg|e\.g\.|i\.[\s]e\.)[\s]': 'for example ' '\b(?:eg|e\.g\.|i\.[\s]e\.)[?:,|.|;]': 'for example' '\b(?:ie|i\.e\.|i\.[\s]e\.)[\s]': 'that is ' '\b(?:ie|i\.e\.|i\.[\s]e\.)[?:,|.|;]': 'that is'
Something within []
is a list character that could match. So it's not needed when only one character is in it. So [\s]
= \s
Then
(?:ie|i\.e\.|i\.[\s]e\.)[\s]
Means this
(?:ie|i\.e\.|i\.\se\.)\s
Then i\.e\.|i\.\se\.
means i.e. with or without a space
So it's
(?:ie|i\.\s?e\.)\s
Then about
'\b(?:eg|e.g.|i.[\s]e.)[?:,|.|;]':
I have a problem with [?:,|.|;]
it's a []
not a ()
, so there is no ?:
or |
operators. And dot must be escaped.
So it should be [,\.;]
Here is what I would suggest
'\b(?:eg|e\.\s?g\.)\s+(?![,\.;])': 'for example '
'\b(?:eg|e\.\s?g\.)\s*[,\.;]': 'for example'
'\b(?:ie|i\.\s?e\.)\s+(?![,\.;])': 'that is '
'\b(?:ie|i\.\s?e\.)\s*[,\.;]': 'that is'
Regexp are such a pain, I know.
Please test them
My issue with your regexp was this
eg ; foo bar
would have been replaced to for example ; foo bar
Please have a look here if you wsjt to understand
Please note I edited my suggested regexp in my previous post
I now realize. I'm unsure that negative look ahead I'm suggesting to use is supported in standard Go regexp package.
Said otherwise, the regexp I'm suggesting, might be invalid/inaccurate.
I'll have to check vale code, and suggest using a library to support it, but it may require some changes in vale code.
I checked vale use a fork of a regexp2 lib.
So it should be OK.
Please test and let me know.
I asked maintainers if they could bump the library to get new feature we don't need right now, but that could be used later
@ccoVeille I tested it around and understood the overlap your last edit fixed. Thanks, as always. Just the TLD exception left.
@tbazant, sorry if this became too complicated.
Regexp are uneasy. If you want and need to catch something. The regexp is either simple and it doesn't catch everything needed, and the cases no one thought about, are not working, or the regexp is complicated, but it works
I've added Latin test files, Latin-good.txt
does not pass with the current setup
The only false positive problem is the .ie TLD, which we'll need to add an exception for. Other than that, I'm not worried about catching IE as an abbreviation of Internet Explorer, since its obscurity ensures most people nowadays refer to it by its full name rather than its abbreviation.
@tbazant I'd argue that we only really need the first 2 lines of Latin-good.txt
.
An alternative Latin-good.txt might be:
www.example.ie
https://example.eg/faq
(I would like to note here that my deep appreciation for Egypt and Ireland is the reason I'm trying to add the exception)
@ccoVeille Do you have an idea on how to write the exception? I've been trying with multiple regex rules that have all failed to exclude the domains.
(I would like to note here that my deep appreciation for Egypt and Ireland is the reason I'm trying to add the exception)
@ccoVeille Do you have an idea on how to write the exception? I've been trying with multiple regex rules that have all failed to exclude the domains.
I suggested a change, please let me know
https://github.com/openSUSE/suse-vale-styleguide/pull/103/files#r1653107808
I've applied the latest suggestion but still all sentences in Latin-good.txt
are matched. BTW if we keep the rule that the bad test files have exactly 1 error on each line, then
vale --config EXAMPLE.vale.ini --output line tests/Latin-bad.txt | wc -l
has to match
cat tests/Latin-bad.txt | wc -l
Regexp are thought, really.
Splitting the regexp in multi lime might help
Like moving the ie
on one line, and the i.e.
on another, for test purpose. You will see what match and what doesn't.
Maybe the negative look behind break everything for all the regexp.
I'm still on a phone, cannot help yet
i replaced all the complex i.e rules with
'\b(?:i\.\s?e\.)': 'that is' # i.e. with possible whitespaces inbetween
'\b(?:ie\.\s+)': 'that is' # ie. followed by whitespaces
'\b(?:ie\.(?=[;,]))': 'that is' # ie. followed by ; or ,
'\b(?:ie[\t\f\r ])': 'that is' # ie followed by whitespaces but not newline
'\b(?:ie(?=[;,]))': 'that is' # ie followed by ; or ,
and set ignorecase: false
which makes all test pass. should i update the PR's suggestions with my findings?
i replaced all the complex i.e rules with
'\b(?:i\.\s?e\.)': 'that is' # i.e. with possible whitespaces inbetween '\b(?:ie\.\s+)': 'that is' # ie. followed by whitespaces '\b(?:ie\.(?=[;,]))': 'that is' # ie. followed by ; or , '\b(?:ie[\t\f\r ])': 'that is' # ie followed by whitespaces but not newline '\b(?:ie(?=[;,]))': 'that is' # ie followed by ; or ,
and set
ignorecase: false
which makes all test pass. should i update the PR's suggestions with my findings?
We can't set ignorecase to false. A lot of people literally start their sentences with IE
.
We can't set ignorecase to false. A lot of people literally start their sentences with
IE
.
If the reason you set ignorecase to false is to except IE for Internet Explorer, I really don't think it's necessary. I have not in the last decade seen anyone refer to it as IE. It's not used anymore, so people don't use its abbreviation.
We can't set ignorecase to false. A lot of people literally start their sentences with
IE
.
maybe, you could use ignorecase:false
and define all the variations you need, but exclude IE capital
ignorecase: false
map:
'\b(?:[iI]\.\s?[eE]\.)': 'that is' # i.e. with possible whitespaces inbetween
'\b(?:[iI][eE]\.\s+)': 'that is' # ie. followed by whitespaces
'\b(?:[iI][eE]\.(?=[;,]))': 'that is' # ie. followed by ; or ,
'\b(?:ie[\t\f\r ])': 'that is' # ie followed by whitespaces but not newline, here IE is excluded
'\b(?:[iI][eE](?=[;,]))': 'that is' # ie followed by ; or ,
Regexp are so fun to maintain
Otherwise, I'm OK with @tbazant keep ignorecase:true, and people who used IE would be unhappy :smile: Users can use vale comment to disable the rule that trying to maintain a complicated regex
It's open for debate
I made last modification, enhanced tests a bit, review please and tell me whether to merge, it's getting long :-)
I fully agree with your need to finally be done with this one. Though, I think I made the last change. If you're okay with it, please merge.
The Latin corrections ended up correct
i.e. daffodils
tothat isdaffodils
so I fixed the spacing error.The commas don't need any extra edits, since the SurroundingCommas.yml rule already corrects for them.