Fix Spacing in Latin.yml

alexvonme commented 5 months ago

The Latin corrections ended up correct i.e. daffodils to that isdaffodils so I fixed the spacing error.

The commas don't need any extra edits, since the SurroundingCommas.yml rule already corrects for them.

ccoVeille commented 5 months ago

Hi guys. The regexp might be difficult to fox by your own.

What about reporting a bug issue to vale Google rules, pointing the issues you found with them

ccoVeille commented 5 months ago

Found this

https://github.com/splunk/vale-splunk-style-guide/blob/main/styles%2FSplunk%2FLatin.yml

Please have a look at this vale repository

alexvonme commented 5 months ago

Found this

https://github.com/splunk/vale-splunk-style-guide/blob/main/styles%2FSplunk%2FLatin.yml

Please have a look at this vale repository

Yes, I saw that one, but it's not necessary in our case. SUSE doesn't seem to care much for most of them and seems to explicitly allow via. We only need to fix i.e. and e.g.

alexvonme commented 5 months ago

@ccoVeille How about this:

  '\b(?:eg|e\.g\.|i\.[\s]e\.)[\s]': 'for example '
  '\b(?:eg|e\.g\.|i\.[\s]e\.)[?:,|.|;]': 'for example'
  '\b(?:ie|i\.e\.|i\.[\s]e\.)[\s]': 'that is '
  '\b(?:ie|i\.e\.|i\.[\s]e\.)[?:,|.|;]': 'that is'

It could be shorter, yes. However, this is easy to read and configure, while catching all possible errors.

The only false positive problem is the .ie TLD, which we'll need to add an exception for. Other than that, I'm not worried about catching IE as an abbreviation of Internet Explorer, since its obscurity ensures most people nowadays refer to it by its full name rather than its abbreviation.

ccoVeille commented 5 months ago

Your regexp can be simplified

'\b(?:eg|e\.g\.|i\.[\s]e\.)[\s]': 'for example ' '\b(?:eg|e\.g\.|i\.[\s]e\.)[?:,|.|;]': 'for example' '\b(?:ie|i\.e\.|i\.[\s]e\.)[\s]': 'that is ' '\b(?:ie|i\.e\.|i\.[\s]e\.)[?:,|.|;]': 'that is'

Something within [] is a list character that could match. So it's not needed when only one character is in it. So [\s] = \s

Then

(?:ie|i\.e\.|i\.[\s]e\.)[\s]

Means this

(?:ie|i\.e\.|i\.\se\.)\s

Then i\.e\.|i\.\se\. means i.e. with or without a space

So it's

(?:ie|i\.\s?e\.)\s

Then about

'\b(?:eg|e.g.|i.[\s]e.)[?:,|.|;]':

I have a problem with [?:,|.|;] it's a [] not a (), so there is no ?: or | operators. And dot must be escaped.

So it should be [,\.;]

ccoVeille commented 5 months ago

Here is what I would suggest

'\b(?:eg|e\.\s?g\.)\s+(?![,\.;])': 'for example '
'\b(?:eg|e\.\s?g\.)\s*[,\.;]': 'for example'
'\b(?:ie|i\.\s?e\.)\s+(?![,\.;])': 'that is '
'\b(?:ie|i\.\s?e\.)\s*[,\.;]': 'that is'

Regexp are such a pain, I know.

Please test them

My issue with your regexp was this

eg ; foo bar would have been replaced to for example ; foo bar

ccoVeille commented 5 months ago

Please have a look here if you wsjt to understand

https://regexr.com/82036

ccoVeille commented 5 months ago

Please note I edited my suggested regexp in my previous post

ccoVeille commented 5 months ago

I now realize. I'm unsure that negative look ahead I'm suggesting to use is supported in standard Go regexp package.

Said otherwise, the regexp I'm suggesting, might be invalid/inaccurate.

I'll have to check vale code, and suggest using a library to support it, but it may require some changes in vale code.

ccoVeille commented 5 months ago

I checked vale use a fork of a regexp2 lib.

So it should be OK.

Please test and let me know.

I asked maintainers if they could bump the library to get new feature we don't need right now, but that could be used later

alexvonme commented 5 months ago

@ccoVeille I tested it around and understood the overlap your last edit fixed. Thanks, as always. Just the TLD exception left.

@tbazant, sorry if this became too complicated.

ccoVeille commented 5 months ago

Regexp are uneasy. If you want and need to catch something. The regexp is either simple and it doesn't catch everything needed, and the cases no one thought about, are not working, or the regexp is complicated, but it works

tbazant commented 5 months ago

I've added Latin test files, Latin-good.txt does not pass with the current setup

alexvonme commented 5 months ago

The only false positive problem is the .ie TLD, which we'll need to add an exception for. Other than that, I'm not worried about catching IE as an abbreviation of Internet Explorer, since its obscurity ensures most people nowadays refer to it by its full name rather than its abbreviation.

@tbazant I'd argue that we only really need the first 2 lines of Latin-good.txt.

alexvonme commented 5 months ago

An alternative Latin-good.txt might be:

www.example.ie
https://example.eg/faq

alexvonme commented 5 months ago

(I would like to note here that my deep appreciation for Egypt and Ireland is the reason I'm trying to add the exception)

@ccoVeille Do you have an idea on how to write the exception? I've been trying with multiple regex rules that have all failed to exclude the domains.

ccoVeille commented 5 months ago

(I would like to note here that my deep appreciation for Egypt and Ireland is the reason I'm trying to add the exception)

@ccoVeille Do you have an idea on how to write the exception? I've been trying with multiple regex rules that have all failed to exclude the domains.

I suggested a change, please let me know

https://github.com/openSUSE/suse-vale-styleguide/pull/103/files#r1653107808

tbazant commented 5 months ago

I've applied the latest suggestion but still all sentences in Latin-good.txt are matched. BTW if we keep the rule that the bad test files have exactly 1 error on each line, then

vale --config EXAMPLE.vale.ini --output line tests/Latin-bad.txt | wc -l

has to match

cat tests/Latin-bad.txt | wc -l

ccoVeille commented 5 months ago

Regexp are thought, really.

Splitting the regexp in multi lime might help

Like moving the ie on one line, and the i.e. on another, for test purpose. You will see what match and what doesn't.

Maybe the negative look behind break everything for all the regexp.

I'm still on a phone, cannot help yet

tbazant commented 5 months ago

i replaced all the complex i.e rules with

  '\b(?:i\.\s?e\.)': 'that is' # i.e. with possible whitespaces inbetween
  '\b(?:ie\.\s+)': 'that is' # ie. followed by whitespaces
  '\b(?:ie\.(?=[;,]))': 'that is' # ie. followed by ; or ,
  '\b(?:ie[\t\f\r ])': 'that is' # ie followed by whitespaces but not newline
  '\b(?:ie(?=[;,]))': 'that is' # ie followed by ; or ,

and set ignorecase: false which makes all test pass. should i update the PR's suggestions with my findings?

alexvonme commented 5 months ago

i replaced all the complex i.e rules with
  '\b(?:i\.\s?e\.)': 'that is' # i.e. with possible whitespaces inbetween
  '\b(?:ie\.\s+)': 'that is' # ie. followed by whitespaces
  '\b(?:ie\.(?=[;,]))': 'that is' # ie. followed by ; or ,
  '\b(?:ie[\t\f\r ])': 'that is' # ie followed by whitespaces but not newline
  '\b(?:ie(?=[;,]))': 'that is' # ie followed by ; or ,
and set ignorecase: false which makes all test pass. should i update the PR's suggestions with my findings?

We can't set ignorecase to false. A lot of people literally start their sentences with IE.

alexvonme commented 5 months ago

We can't set ignorecase to false. A lot of people literally start their sentences with IE.

If the reason you set ignorecase to false is to except IE for Internet Explorer, I really don't think it's necessary. I have not in the last decade seen anyone refer to it as IE. It's not used anymore, so people don't use its abbreviation.

ccoVeille commented 5 months ago

We can't set ignorecase to false. A lot of people literally start their sentences with IE.

maybe, you could use ignorecase:false and define all the variations you need, but exclude IE capital

ignorecase: false
map:
  '\b(?:[iI]\.\s?[eE]\.)': 'that is' # i.e. with possible whitespaces inbetween
  '\b(?:[iI][eE]\.\s+)': 'that is' # ie. followed by whitespaces
  '\b(?:[iI][eE]\.(?=[;,]))': 'that is' # ie. followed by ; or ,
  '\b(?:ie[\t\f\r ])': 'that is' # ie followed by whitespaces but not newline, here IE is excluded
  '\b(?:[iI][eE](?=[;,]))': 'that is' # ie followed by ; or ,

Regexp are so fun to maintain

Otherwise, I'm OK with @tbazant keep ignorecase:true, and people who used IE would be unhappy :smile: Users can use vale comment to disable the rule that trying to maintain a complicated regex

It's open for debate

tbazant commented 5 months ago

I made last modification, enhanced tests a bit, review please and tell me whether to merge, it's getting long :-)

alexvonme commented 5 months ago

I fully agree with your need to finally be done with this one. Though, I think I made the last change. If you're okay with it, please merge.

openSUSE / suse-vale-styleguide

Fix Spacing in Latin.yml #103