johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.91k stars 215 forks source link

After missing optional pattern group in data field, RegEx match sets remaining pattern group results null ? #277

Closed pjfarleyiii closed 4 years ago

pjfarleyiii commented 4 years ago

I am not sure if this is a bug or my own ignorance. Please advise.

    C:\Users\Username\Project>mlr --version
    Miller 5.6.2

    C:\Users\Username\Project>cat testbug.csv
    Name,Number,Memo
    CITY WATER BOARD Jun 27 05:16 9999 Reference# 13834,,

    C:\Users\Username\Project>cat testbug.mlr
    print "$Name=" . $Name;
    $Name =~ "^(.+)((Jan|Feb|Mar|Apr|May|Jun[e]?|Jul[y]?|Aug|Sep[t]?|Oct|Nov|Dec) [0-9]+ [0-9:]+ [0-9]+ (ONLINE )?Reference.) ([0-9]+)$" {
        print "Found Mth date/time Reference# append, pt1=\1,pt2=\2,pt3=\3,pt4=\4,pt5=\5" .
              ",pt6=\6,pt7=\7,pt8=\8,pt9=\9";
        $Name = rstrip("\1");
        $Number = "\5";
        $Memo = clean_whitespace($Memo . " \2 \5")
    }

    C:\Users\Username\Project>mlr --csv put -f testbug.mlr testbug.csv
    $Name=CITY WATER BOARD Jun 27 05:16 8699 Reference# 13834
    Found Mth date/time Reference# append, pt1=NYC WATER BOARD ,pt2=Jun 27 05:16 9999 Reference#,pt3=Jun,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
    Name,Number,Memo
    NYC WATER BOARD,,Jun 27 05:16 9999 Reference#

Pattern group 4 = (ONLINE )? is the null string as it should be, but pattern group 5 = ([0-9]+)$ ahould be 13834 but it is null. Why is pattern group 5 not remembered please?

The equivalent gawk program does capture regex group 5, as shown below.

    C:\Users\Username\Project>gawk --version
    GNU Awk 5.0.1, API: 2.0 (GNU MPFR 3.1.5, GNU MP 6.1.2)

    C:\Users\Username\Project>cat testbug.awk
    BEGIN {
        FS = ","
        OFS = ","
    }
    function Squeeze(src) { gsub(/[ ]+/, " ", src); return src }
    function Trim(src)  {
        return gensub(/[ \t]+$/, "", 1, gensub(/^[ \t]+/, "", 1, src))
    }
    NR > 1 { print "$1=" $1 }
    $1 ~ /^(.+)((Jan|Feb|Mar|Apr|May|Jun[e]?|Jul[y]?|Aug|Sep[t]?|Oct|Nov|Dec) [0-9]+ [0-9:]+ [0-9]+ (ONLINE )?Reference.) ([0-9]+)$/ {
        n = match($1, /^(.+)((Jan|Feb|Mar|Apr|May|Jun[e]?|Jul[y]?|Aug|Sep[t]?|Oct|Nov|Dec) [0-9]+ [0-9:]+ [0-9]+ (ONLINE )?Reference.) ([0-9]+)$/, pt)
        print "Found Mth date/time Reference# append, n=" n ",pt1=" pt[1] ",pt2=" pt[2] ",pt3=" pt[3] ",pt4=" pt[4] ",pt5=" pt[5] \
              ",pt6=" pt[6] ",pt7=" pt[7] ",pt8=" pt[8] ",pt9=" pt[9]
        $1 = Trim(pt[1]);
        $2 = pt[5];
        $3 = Squeeze(Trim($3 " " pt[2] " " pt[5]))
        $0 = $0
        print $0
    }

    C:\Users\Username\Project>gawk -f testbug.awk testbug.csv
    $1=CITY WATER BOARD Jun 27 05:16 9999 Reference# 13834
    Found Mth date/time Reference# append, n=1,pt1=CITY WATER BOARD ,pt2=Jun 27 05:16 9999 Reference#,pt3=Jun,pt4=,pt5=13834,pt6=,pt7=,pt8=,pt9=
    CITY WATER BOARD,13834,Jun 27 05:16 9999 Reference# 13834

Peter

pjfarleyiii commented 4 years ago

While continuing my research on this issue I added a second data row to the test csv file that does have the optional ONLINE word in the Name value, and that row is correctly handled, details below.

So the question is, when the 4th pattern group is NOT present, why is the fifth group also set to the null string?

Peter

    C:\Users\Username\Project>cat testbug.csv
    Name,Number,Memo
    CITY WATER BOARD Jun 27 05:16 9999 Reference# 13834,,
    ONLINE - TRANSFER TO MORTGAGE Jun 01 08:18 9999 ONLINE Reference# 3802,,

    C:\Users\Username\Project>mlr --csv put -f testbug.mlr testbug.csv
    $Name=CITY WATER BOARD Jun 27 05:16 9999 Reference# 13834
    Found Mth date/time Reference# append, pt1=CITY WATER BOARD ,pt2=Jun 27 05:16 9999 Reference#,pt3=Jun,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
    Name,Number,Memo
    CITY WATER BOARD,,Jun 27 05:16 9999 Reference#
    $Name=ONLINE - TRANSFER TO MORTGAGE Jun 01 08:18 9999 ONLINE Reference# 3802
    Found Mth date/time Reference# append, pt1=ONLINE - TRANSFER TO MORTGAGE ,pt2=Jun 01 08:18 9999 ONLINE Reference#,pt3=Jun,pt4=ONLINE ,pt5=3802,pt6=,pt7=,pt8=,pt9=
    ONLINE - TRANSFER TO MORTGAGE,3802,Jun 01 08:18 9999 ONLINE Reference# 3802
pjfarleyiii commented 4 years ago

I think there is a serious functional issue with the regex optional pattern code. Below is a very simple example that demonstrates the issue.

Whenever an optional regex pattern group is coded. like (Bill )? in the example, if the data field does NOT contain the optional component then none of the pattern groups following the optional regex pattern group are captured.

Please investigate this issue ASAP. I have multiple needs for using this regex pattern to regularize my data and I was hoping to use miller to do that work for me, but I really cannot continue without this pattern functioning correctly,

Thank you for such a powerful and easy to use tool, but please try to resolve this issue quickly.

Peter

    C:\Users\Username\Project>cat testbug2.csv
    Name,Number,Memo
    Payment NOWHERE INSURANCE,,JUN 20 08:32p
    Bill Payment CITY WATER CO,,SEP 9 09:10p

    C:\Users\Peter\Downloads\Citibank>cat testbug2.mlr
    print "$Name=" . $Name;
    $Name =~ "^(Bill )?Payment (.+)$" {
        print "Found Payment, pt1=\1,pt2=\2,pt3=\3,pt4=\4,pt5=\5" .
              ",pt6=\6,pt7=\7,pt8=\8,pt9=\9";
        $Name = "\2";
        $Memo = clean_whitespace("\1Payment " . $Memo)
    }

    C:\Users\Username\Project>mlr --csv put -f testbug2.mlr testbug2.csv
    $Name=Payment NOWHERE INSURANCE
    Found Payment, pt1=,pt2=,pt3=,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
    Name,Number,Memo
    ,,Payment JUN 20 08:32p
    $Name=Bill Payment CITY WATER CO
    Found Payment, pt1=Bill ,pt2=CITY WATER CO,pt3=,pt4=,pt5=,pt6=,pt7=,pt8=,pt9=
    CITY WATER CO,,Bill Payment SEP 9 09:10p
johnkerl commented 4 years ago

Thank you -- I will investigate.

I have multiple business travel commitments this month as well as highly non-trivial RL at present, so I don't know what I can promise you in regard to turnaround time on this open-source project.

pjfarleyiii commented 4 years ago

Thank you for promising to investigate. I do understand how RL commitments affect our voluntary online lives. I have a workaround for the present (two different pattern matches) for the issues I have encountered so far. Depending on your timeline for finding and resolving this miller issue I may have to fall back on intermediate gawk scripts for the interim, but that is no bad thing per se, it is one of my comfort zones linguistically.

I will monitor for your findings on this miller issue when you get time to deal with them.

Peter

johnkerl commented 4 years ago

OK @pjfarleyiii ... underlying issue is I didn't know about optional regex captures when I wrote this way back. (Just plugged into the C regex library.)

Test script (note $HOME/bin/mlr is old, ./mlr is new):

#!/bin/bash

echo ================================================================
echo -n 'sed: '; echo 'abcdefg' | sed -E 's/ab(.)d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(.)d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abcdefg' |         ./mlr --nidx put '$1 = sub($1, "ab(.)d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(.)d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abcdefg' |         ./mlr --inidx --odkvp put '$1 =~ "ab(.)d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo

echo ================================================================
echo -n 'sed: '; echo 'abcdefg' | sed -E 's/ab(c)?d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abcdefg' |         ./mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abcdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(.)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abcdefg' |         ./mlr --inidx --odkvp put '$1 =~ "ab(.)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo

echo ================================================================
echo -n 'sed: '; echo 'abXdefg' | sed -E 's/ab(c)?d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abXdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abXdefg' |         ./mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abXdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abXdefg' |         ./mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo

echo ================================================================
echo -n 'sed: '; echo 'abdefg' | sed -E 's/ab(c)?d(..)g/ab<<\1>>d<<\2>>g/'
echo -n 'old: '; echo 'abdefg' | $HOME/bin/mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'new: '; echo 'abdefg' |         ./mlr --nidx put '$1 = sub($1, "ab(c)?d(..)g", "ab<<\1>>d<<\2>>g")'
echo -n 'old: '; echo 'abdefg' | $HOME/bin/mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo -n 'new: '; echo 'abdefg' |         ./mlr --inidx --odkvp put '$1 =~ "ab(c)?d(..)g" { $pt1 = "\1"; $pt2 = "\2"}'
echo

echo ================================================================

Output:

$ tryre.sh
================================================================
sed: ab<<c>>d<<ef>>g
old: ab<<c>>d<<ef>>g
new: ab<<c>>d<<ef>>g
old: 1=abcdefg,pt1=c,pt2=ef
new: 1=abcdefg,pt1=c,pt2=ef

================================================================
sed: ab<<c>>d<<ef>>g
old: ab<<c>>d<<ef>>g
new: ab<<c>>d<<ef>>g
old: 1=abcdefg,pt1=c,pt2=ef
new: 1=abcdefg,pt1=c,pt2=ef

================================================================
sed: abXdefg
old: abXdefg
new: abXdefg
old: 1=abXdefg
new: 1=abXdefg

================================================================
sed: ab<<>>d<<ef>>g
old: ab<<\1>>d<<ef>>g
new: ab<<>>d<<ef>>g
old: 1=abdefg,pt1=,pt2=
new: 1=abdefg,pt1=,pt2=ef

================================================================

This will go into head soon.

johnkerl commented 4 years ago

Also:

$ cat testbug2.csv
Name,Number,Memo
Payment NOWHERE INSURANCE,,JUN 20 08:32p
Bill Payment CITY WATER CO,,SEP 9 09:10p

$ cat testbug2.mlr
$Name =~ "^(Bill )?Payment (.+)$" {
    $pt1 = "\1";
    $pt2 = "\2";
}

$ cat testbug2.sh
mlr --c2j put -f testbug2.mlr testbug2.csv

$ sh testbug2.sh | jq .
{
  "Name": "Payment NOWHERE INSURANCE",
  "Number": "",
  "Memo": "JUN 20 08:32p",
  "pt1": "",
  "pt2": "NOWHERE INSURANCE"
}
{
  "Name": "Bill Payment CITY WATER CO",
  "Number": "",
  "Memo": "SEP 9 09:10p",
  "pt1": "Bill ",
  "pt2": "CITY WATER CO"
}
johnkerl commented 4 years ago

Fixed in commit https://github.com/johnkerl/miller/commit/ab3407256dabdb6449594c3f9f1c5f89a09d17b3

Thank you for reporting this!!

pjfarleyiii commented 4 years ago

You are welcome. I would love to test the commit, but I have not been building miller from source prior to this point. I DL'd your 5.6.2 pre-built version for linux and WIndows and have been using them for my testing.

I suspect from the link you provided that I would need to git clone the source including this commit and build from source to test it, is that correct?

[Edit] How can I run the source build? I DL'ed the source as a zip file and I do not see a configure script executable. Are you not using GNU autotools for this build? If so I would appreciate instructions on starting a new build.

[Edit] And I found the build documentation online and will proceed to try a build under WIn10 WSL Debian 9.

More as it happens.

Peter

pjfarleyiii commented 4 years ago

Built miller from source OK using GNU autoconfig tools in a WIn10 WSL Debian 9 environment per the docs. There were some differences in make check after the build, which I ignored for now. I can send you the make check log if you would like to review it.

This commit fixes the bug I reported and processes optional regexp groups correctly from my understanding and as gawk currently processes them.

I was also able to DL the latest Windows dev build version from Appveyor and it tested the same as the linux version I just built in a Win10 CMD.EXE environment. The Appveyor version passed both of the bug examples I submitted.

I consider this issue resolved. Thank you very much for the quick turnaround!

I will monitor for your next point release so I can remove the build environment and continue with a stable version, but as of tonight I am good to go.

Thank you again.

Peter

pjfarleyiii commented 4 years ago

Sorry, I hit the wrong button. I did not intend to close this for you, that is your prerogative.

Peter

johnkerl commented 4 years ago

:)