Closed pjfarleyiii closed 4 years ago
While continuing my research on this issue I added a second data row to the test csv file that does have the optional ONLINE word in the Name value, and that row is correctly handled, details below.
So the question is, when the 4th pattern group is NOT present, why is the fifth group also set to the null string?
Peter
C:\Users\Username\Project>cat testbug.csv
Name,Number,Memo
CITY WATER BOARD Jun 27 05:16 9999 Reference# 13834,,
ONLINE - TRANSFER TO MORTGAGE Jun 01 08:18 9999 ONLINE Reference# 3802,,
C:\Users\Username\Project>mlr --csv put -f testbug.mlr testbug.csv
$Name=CITY WATER BOARD Jun 27 05:16 9999 Reference# 13834
Found Mth date/time Reference# append, pt1=CITY WATER BOARD ,pt2=Jun 27 05:16 9999 Reference#,pt3=Jun,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
Name,Number,Memo
CITY WATER BOARD,,Jun 27 05:16 9999 Reference#
$Name=ONLINE - TRANSFER TO MORTGAGE Jun 01 08:18 9999 ONLINE Reference# 3802
Found Mth date/time Reference# append, pt1=ONLINE - TRANSFER TO MORTGAGE ,pt2=Jun 01 08:18 9999 ONLINE Reference#,pt3=Jun,pt4=ONLINE ,pt5=3802,pt6=,pt7=,pt8=,pt9=
ONLINE - TRANSFER TO MORTGAGE,3802,Jun 01 08:18 9999 ONLINE Reference# 3802
I think there is a serious functional issue with the regex optional pattern code. Below is a very simple example that demonstrates the issue.
Whenever an optional regex pattern group is coded. like (Bill )? in the example, if the data field does NOT contain the optional component then none of the pattern groups following the optional regex pattern group are captured.
Please investigate this issue ASAP. I have multiple needs for using this regex pattern to regularize my data and I was hoping to use miller to do that work for me, but I really cannot continue without this pattern functioning correctly,
Thank you for such a powerful and easy to use tool, but please try to resolve this issue quickly.
Peter
C:\Users\Username\Project>cat testbug2.csv
Name,Number,Memo
Payment NOWHERE INSURANCE,,JUN 20 08:32p
Bill Payment CITY WATER CO,,SEP 9 09:10p
C:\Users\Peter\Downloads\Citibank>cat testbug2.mlr
print "$Name=" . $Name;
$Name =~ "^(Bill )?Payment (.+)$" {
print "Found Payment, pt1=\1,pt2=\2,pt3=\3,pt4=\4,pt5=\5" .
",pt6=\6,pt7=\7,pt8=\8,pt9=\9";
$Name = "\2";
$Memo = clean_whitespace("\1Payment " . $Memo)
}
C:\Users\Username\Project>mlr --csv put -f testbug2.mlr testbug2.csv
$Name=Payment NOWHERE INSURANCE
Found Payment, pt1=,pt2=,pt3=,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
Name,Number,Memo
,,Payment JUN 20 08:32p
$Name=Bill Payment CITY WATER CO
Found Payment, pt1=Bill ,pt2=CITY WATER CO,pt3=,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
CITY WATER CO,,Bill Payment SEP 9 09:10p
Thank you -- I will investigate.
I have multiple business travel commitments this month as well as highly non-trivial RL at present, so I don't know what I can promise you in regard to turnaround time on this open-source project.
Thank you for promising to investigate. I do understand how RL commitments affect our voluntary online lives. I have a workaround for the present (two different pattern matches) for the issues I have encountered so far. Depending on your timeline for finding and resolving this miller issue I may have to fall back on intermediate gawk scripts for the interim, but that is no bad thing per se, it is one of my comfort zones linguistically.
I will monitor for your findings on this miller issue when you get time to deal with them.
Peter
OK @pjfarleyiii ... underlying issue is I didn't know about optional regex captures when I wrote this way back. (Just plugged into the C regex library.)
Test script (note $HOME/bin/mlr
is old, ./mlr
is new):
#!/bin/bash
echo ================================================================
echo -n 'sed: '; echo 'abcdefg' | sed -E 's/ab(.)d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(.)d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abcdefg' | ./mlr --nidx put '$1 = sub($1, "ab(.)d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(.)d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abcdefg' | ./mlr --inidx --odkvp put '$1 =~ "ab(.)d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo
echo ================================================================
echo -n 'sed: '; echo 'abcdefg' | sed -E 's/ab(c)?d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abcdefg' | ./mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(.)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abcdefg' | ./mlr --inidx --odkvp put '$1 =~ "ab(.)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo
echo ================================================================
echo -n 'sed: '; echo 'abXdefg' | sed -E 's/ab(c)?d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abXdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abXdefg' | ./mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abXdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abXdefg' | ./mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo
echo ================================================================
echo -n 'sed: '; echo 'abdefg' | sed -E 's/ab(c)?d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abdefg' | ./mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abdefg' | ./mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo
echo ================================================================
Output:
$ tryre.sh
================================================================
sed: ab<<c>>d<<ef>>g
old: ab<<c>>d<<ef>>g
new: ab<<c>>d<<ef>>g
old: 1=abcdefg,pt1=c,pt2=ef
new: 1=abcdefg,pt1=c,pt2=ef
================================================================
sed: ab<<c>>d<<ef>>g
old: ab<<c>>d<<ef>>g
new: ab<<c>>d<<ef>>g
old: 1=abcdefg,pt1=c,pt2=ef
new: 1=abcdefg,pt1=c,pt2=ef
================================================================
sed: abXdefg
old: abXdefg
new: abXdefg
old: 1=abXdefg
new: 1=abXdefg
================================================================
sed: ab<<>>d<<ef>>g
old: ab<<\1>>d<<ef>>g
new: ab<<>>d<<ef>>g
old: 1=abdefg,pt1=,pt2=
new: 1=abdefg,pt1=,pt2=ef
================================================================
This will go into head soon.
Also:
$ cat testbug2.csv
Name,Number,Memo
Payment NOWHERE INSURANCE,,JUN 20 08:32p
Bill Payment CITY WATER CO,,SEP 9 09:10p
$ cat testbug2.mlr
$Name =~ "^(Bill )?Payment (.+)$" {
$pt1 = "\1";
$pt2 = "\2";
}
$ cat testbug2.sh
mlr --c2j put -f testbug2.mlr testbug2.csv
$ sh testbug2.sh | jq .
{
"Name": "Payment NOWHERE INSURANCE",
"Number": "",
"Memo": "JUN 20 08:32p",
"pt1": "",
"pt2": "NOWHERE INSURANCE"
}
{
"Name": "Bill Payment CITY WATER CO",
"Number": "",
"Memo": "SEP 9 09:10p",
"pt1": "Bill ",
"pt2": "CITY WATER CO"
}
Fixed in commit https://github.com/johnkerl/miller/commit/ab3407256dabdb6449594c3f9f1c5f89a09d17b3
Thank you for reporting this!!
You are welcome. I would love to test the commit, but I have not been building miller from source prior to this point. I DL'd your 5.6.2 pre-built version for linux and WIndows and have been using them for my testing.
I suspect from the link you provided that I would need to git clone the source including this commit and build from source to test it, is that correct?
[Edit] How can I run the source build? I DL'ed the source as a zip file and I do not see a configure script executable. Are you not using GNU autotools for this build? If so I would appreciate instructions on starting a new build.
[Edit] And I found the build documentation online and will proceed to try a build under WIn10 WSL Debian 9.
More as it happens.
Peter
Built miller from source OK using GNU autoconfig tools in a WIn10 WSL Debian 9 environment per the docs. There were some differences in make check after the build, which I ignored for now. I can send you the make check log if you would like to review it.
This commit fixes the bug I reported and processes optional regexp groups correctly from my understanding and as gawk currently processes them.
I was also able to DL the latest Windows dev build version from Appveyor and it tested the same as the linux version I just built in a Win10 CMD.EXE environment. The Appveyor version passed both of the bug examples I submitted.
I consider this issue resolved. Thank you very much for the quick turnaround!
I will monitor for your next point release so I can remove the build environment and continue with a stable version, but as of tonight I am good to go.
Thank you again.
Peter
Sorry, I hit the wrong button. I did not intend to close this for you, that is your prerogative.
Peter
:)
I am not sure if this is a bug or my own ignorance. Please advise.
Pattern group 4 = (ONLINE )? is the null string as it should be, but pattern group 5 = ([0-9]+)$ ahould be 13834 but it is null. Why is pattern group 5 not remembered please?
The equivalent gawk program does capture regex group 5, as shown below.
Peter