makamaka / Text-CSV

comma-separated values manipulator
24 stars 19 forks source link

is_missing() working properly? #27

Closed rwhitworth closed 7 years ago

rwhitworth commented 9 years ago

I notice there are no tests for the is_missing() function, so I created a few, but it looks like is_missing returns undef when it shouldn't. Am I misunderstanding the is_missing or getline_hr functions?

The test is also my fork of this repo in case Markdown eats the code..

!/usr/bin/perl

use strict; $^W = 1;

use Test::More tests => 9;

BEGIN { $ENV{PERL_TEXT_CSV} = 0; use_ok "Text::CSV", (); plan skip_all => "Cannot load Text::CSV" if $@; }

open FH, ">_99test.csv"; print FH <<EOC;

2 EOC close FH;

ok (my $csv = Text::CSV->new (), "new"); is ($csv->is_missing(0), undef, "is_missing()");

open FH, "<_99test.csv"; ok ($csv->column_names ('code')); ok (my $hr = $csv->getline_hr (_FH)); is ($csv->is_missing(0), undef, "is_missing()"); ok ($hr = $csv->getline_hr (_FH)); is (int $hr->{code}, 2, "code==2"); isnt ($csv->is_missing(0), undef, "isn't is_missing()"); close FH;

Tux commented 8 years ago

At least in Text::CSV_XS this is not a bug: when using getline_hr and column_names are set, none of the columns can be missing in a _hr call, so is_missing is forced to be false for every column. This is documented:

   is_missing
        my $missing = $csv->is_missing ($column_idx);

       Where  $column_idx is the  (zero-based)  index of the column in the
       last result of "getline_hr".

        $csv->keep_meta_info (1);
        while (my $hr = $csv->getline_hr ($fh)) {
            $csv->is_missing (0) and next; # This was an empty line
            }

       When using  "getline_hr",  it is impossible to tell if the  parsed
       fields are "undef" because they where not filled in the "CSV" stream
       or because they were not read at all, as all the fields defined by
       "column_names" are set in the hash-ref.    If you still need to know if
       all fields in each row are provided, you should enable "keep_meta_info"
       so you can check the flags.

I admit that the docs can be more clear about that. Suggestions welcome.

charsbar commented 8 years ago

@Tux, is that true when we set "keep_meta_info" to true as I did in https://rt.cpan.org/Public/Bug/Display.html?id=117495 ?

I suspect the following line that sets 0x0010 (CSV_FLAGS_MIS) does not work correctly when $#{$fr} is 0. (ie. because we got nothing, shouldn't we mark everything as CSV_FLAGS_MIS, instead of from 1 to $#_COLUMN_NAMES ? )

$self->{_FFLAGS}[$_] = 0x0010 for ($#{$fr} + 1) .. $#{$self->{_COLUMN_NAMES}};

(https://github.com/Tux/Text-CSV_XS/blob/master/CSV_XS.pm#L852 ; Same can be said for https://github.com/makamaka/Text-CSV/blob/master/lib/Text/CSV_PP.pm#L814 )

Tux commented 8 years ago

I think you are right. Checking …

Uh, no: if you have an empty line, there is still ONE field: an empty field. That field is not missing.

So, if you set just one single column name, that will never be missing, unless you have a parse error on the first field. This however is most likely the most correct code:

@@ -849,7 +849,7 @@ sub getline_hr {
     $self->{_COLUMN_NAMES} or croak ($self->SetDiag (3002));
     my $fr = $self->getline (@args) or return;
     if (ref $self->{_FFLAGS}) {
-       $self->{_FFLAGS}[$_] = 0x0010 for ($#{$fr} + 1) .. $#{$self->{_COLUMN_NAMES}};
+       $self->{_FFLAGS}[$_] = 0x0010 for (@$fr ? $#{$fr} + 1 : 0) .. $#{$self->{_COLUMN_NAMES}};
        }
     @hr{@{$self->{_COLUMN_NAMES}}} = @$fr;
     \%hr;
charsbar commented 8 years ago

@Tux, yeah, your patch in the above message seems most reasonable to me, especially because the example code in the pod has been saying you can skip an empty line if is_missing(0) returns true.

$csv->keep_meta_info (1);
while (my $hr = $csv->getline_hr ($fh)) {
    $csv->is_missing (0) and next; # This was an empty line
    }

If we consider there is always at least one field, that is_missing(0) returns always false and the line will never be skipped, contrary to what's said there.

Tux commented 8 years ago

Thanks for pointing there. That example should be changed. I'll reconsider the docs and the current behavior. Maybe a complete empty line should set missing, even if it is legal.

rwhitworth commented 8 years ago

Any thoughts on if this would be a bug when Text::CSV_PP is in use?

Tux commented 8 years ago

@rwhitworth your assumption was buggy in the first place (even though the docs are vague) An empty line is legal and thus returns a single empty field As the current behavior is the same in both _PP and _XS, I cannot guarantee that if I change the behavor in _XS it will also be changed in _PP (which I do not maintain) I thank you for pointing at the vagueness though! Anything that can be improved should be improved (eventually)

I think that detection an empty line using getline_hr would take some serious extra steps from a user point of view, something like

# for CSV, CSV_PP and CSV_XS
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, keep_meta_info => 1, blank_is_undef => 1 });
$csv->column_names ("code", "desc");
while (my $row = $csv->getline_hr ($fh)) {
    $csv->is_missing (1) && !defined $row->{code} and say "EMPTY!";
    }

if you have just one single column, checking for definedness with blank_is_undef should be sufficient

charsbar commented 8 years ago

@Tux, the above code is hard to write, and actually doesn't work (says "EMPTY!" for the first two lines) if the csv looks like the following, and empty_is_undef is set to true:

""
(empty line)
1,2
3

I agree an empty line is legal in general, but I'd prefer to keep the doc and change the behavior.

@rwhitworth, your test code you've sent us as a PR has a few issues: 1) it doesn't set keep_meta_info to true, so is_missing() doesn't work as you might expect (this has been documented, though _PP's doc is vague; I'd update it when this settles); 2) your assumption that is_missing() returns either undef or non-undef is not correct. Actually it returns undef only if $csv object hasn't called getline and its friends, and it returns 0 or 1 afterwards (as of this writing).

rwhitworth commented 8 years ago

I don't fully follow, but I do appreciate the time you both (@Tux and @charsbar) have spent. Feel free to close this issue and/or my PR when it is appropriate.

charsbar commented 8 years ago

@rwhitworth , OK, thanks for your kindness. But please don't hesitate to say which or how you prefer to write when you parse your csv. Like or unlike is an important input for both of us! :)

Tux commented 8 years ago

I still have not made up my mind. So hard to choose between multiple versions of correct :)

Tux commented 7 years ago

@rwhitworth I have now committed this: https://github.com/Tux/Text-CSV_XS/commit/6b71f3937ff8a73b26618cf11cf142f805edf357

   is_missing
        my $missing = $csv->is_missing ($column_idx);

       Where  $column_idx is the  (zero-based)  index of the column in the
       last result of "getline_hr".

        $csv->keep_meta_info (1);
        while (my $hr = $csv->getline_hr ($fh)) {
            $csv->is_missing (0) and next; # This was an empty line
            }

       When using  "getline_hr",  it is impossible to tell if the  parsed
       fields are "undef" because they where not filled in the "CSV" stream
       or because they were not read at all, as all the fields defined by
       "column_names" are set in the hash-ref.    If you still need to know if
       all fields in each row are provided, you should enable "keep_meta_info"
       so you can check the flags.

       If  "keep_meta_info"  is "false",  "is_missing"  will always return
       "undef", regardless of $column_idx being valid or not. If this
       attribute is "true" it will return either 0 (the field is present) or 1
       (the field is missing).

       A special case is the empty line.  If the line is completely empty -
       after dealing with the flags - this is still a valid CSV line:  it is a
       record of just one single empty field. However, if "keep_meta_info" is
       set, invoking "is_missing" with index 0 will now return true.

And added this to the test suite:

open $fh, ">", $tfn or die "$tfn: $!\n";
print $fh <<"EOC";
a,b

2
EOC
close $fh;

ok ($csv = Text::CSV_XS->new (), "new");

open $fh, "<", $tfn or die "$tfn: $!\n";
ok ($csv->column_names ("code", "foo"), "set column names");
ok ($hr = $csv->getline_hr ($fh), "get header line");
is ($csv->is_missing (0), undef, "not is_missing () - no meta");
is ($csv->is_missing (1), undef, "not is_missing () - no meta");
ok ($hr = $csv->getline_hr ($fh), "get empty line");
is ($csv->is_missing (0), undef, "not is_missing () - no meta");
is ($csv->is_missing (1), undef, "not is_missing () - no meta");
ok ($hr = $csv->getline_hr ($fh), "get partial data line");
is (int $hr->{code}, 2, "code == 2");
is ($csv->is_missing (0), undef, "not is_missing () - no meta");
is ($csv->is_missing (1), undef, "not is_missing () - no meta");
close $fh;

open $fh, "<", $tfn or die "$tfn: $!\n";
$csv->keep_meta_info (1);
ok ($csv->column_names ("code", "foo"), "set column names");
ok ($hr = $csv->getline_hr ($fh), "get header line");
is ($csv->is_missing (0), 0, "not is_missing () - with meta");
is ($csv->is_missing (1), 0, "not is_missing () - with meta");
ok ($hr = $csv->getline_hr ($fh), "get empty line");
is ($csv->is_missing (0), 1, "not is_missing () - with meta");
is ($csv->is_missing (1), 1, "not is_missing () - with meta");
ok ($hr = $csv->getline_hr ($fh), "get partial data line");
is (int $hr->{code}, 2, "code == 2");
is ($csv->is_missing (0), 0, "not is_missing () - with meta");
is ($csv->is_missing (1), 1, "not is_missing () - with meta");
close $fh;
charsbar commented 7 years ago

And Text::CSV_PP passes the test above now, with the above commit. Thanks.