srvk / DiViMe

ACLEW Diarization Virtual Machine
Apache License 2.0
32 stars 9 forks source link

remove blanket eval to only use evalSAD (which should be --> evalSad) and evalDiar #85

Open alecristia opened 5 years ago

alecristia commented 5 years ago

by using the separate eval, the instructions for the users can be easier

plus it makes it clearer which evals can be used with which tools

riebling commented 5 years ago

Progress on this: can now run evalSad.sh (which uses dscore now, not ldc_sad_hmm) in a way such as this:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ evalSad.sh data/VanDam-Daylong/BN32/test noisemesSad --keep-temp
mkdir -p /vagrant/data/VanDam-Daylong/BN32/test/temp_ref
creating:  /vagrant/data/VanDam-Daylong/BN32/test/temp_ref/BN32_010007_test.lab
creating:  /vagrant/data/VanDam-Daylong/BN32/test/temp_sys/BN32_010007_test.lab
evaluating
done evaluating, check  

with results that look like this:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ cat data/VanDam-Daylong/BN32/test/noisemesSad_eval.df 
DER     B3Precision     B3Recall        B3F1    TauRefSys       TauSysRef       CE      MI      NMI
BN32_010007_test        89.43   0.9747204105    0.809801359177  0.884640273549  0.000403475776814       0.000403475776815       0.0985361241149 0.000350603912437       0.00159407640057

but a question is: this does not look like the format from before. Not sure if this result format is acceptable. In order to get this working as part of the VM self-test, changes were made to test.sh to more properly format the reference RTTM; it was using a filename (column 2) that did not match the file basename, and was missing 'speech' in column 8, so now looks more like this:

SPEAKER BN32_010007_test 1 56.0 1.4 <NA> <NA> speech <NA>
SPEAKER BN32_010007_test 1 62.4 0.3 <NA> <NA> speech <NA>
SPEAKER BN32_010007_test 1 65.7 2.4 <NA> <NA> speech <NA>
alecristia commented 5 years ago

actually, this format relies only on rttms, so there is no need to create .lab (that was for the script evalsad relied on).

However, I see something else that troubles me: the output has only DER, but not the 3-way split in false alarms (FA), misses (M), and attribution errors (AE)... Could you check whether this information is produced in an intermediate step, and then discarded? If so, could we please add the three columns to the evaluation output? So the header would read: FA M AE DER B3Precision B3Recall B3F1 TauRefSys TauSysRef CE MI NMI

riebling commented 5 years ago

dscore's score_batch.py only produces the columns we see, with facility to "by hand" add additional column/value pairs. I don't see anywhere in the code's internal documentation or usage info

- diarization error rate (DER)                                                                                                              
- B-cubed precision                                                                                                                         
- B-cubed recall                                                                                                                            
- B-cubed F1                                                                                                                                
- Goodman-Kruskal tau in the direction of the reference diarization to the                                                                  
  system diarization (GKT(ref, sys))                                                                                                        
- Goodman-Kruskal tau in the direction of the system diarization to the                                                                     
  reference diarization (GKT(sys, ref))                                                                                                     
- conditional entropy of the reference diarization given the system                                                                         
  diarization in bits (H(ref|sys))                                                                                                          
- mutual information in bits (MI)                                                                                                           
- normalized mutual information (NMI)                                                                                                       

Diarization error rate (DER) is scored using the NIST ``md-eval.pl`` tool                                                                   
using a default collar size of 0 ms. If the value is not zero, it's ignoring                                                                
regions that contain overlapping speech in the reference RTTM. If desired,                                                                  
this behavior can be altered using the ``--collar`` and ``--score_overlaps`` flags.                                                         
For instance :                                                                                                                              

    python --collar 0.100 --score_overlaps score.py ref.rttm sys.rttm                                                                       

would compute DER using a 100 ms collar and with overlapped speech included.                                                                

All other metrics are computed off of frame-level labelings created from the                                                                
turns in the RTTM files **WITHOUT** any use of collars. The default frame                                                                   
step is 10 ms, which may be altered via the ``--step`` flag.     

the computation of false alarms, misses, or attribution errors. Looking all the way to the lowest level dscore/scorelib/score.py. Here's the code that prints the header:

        col_names = ['DER', # Diarization error rate.                                                                                       
                     'B3Precision', # B-cubed precision.                                                                                    
                     'B3Recall', # B-cubed recall.                                                                                          
                     'B3F1', # B-cubed F1.                                                                                                  
                     'TauRefSys', # Goodman-Kruskal tau ref --> sys.                                                                        
                     'TauSysRef', # Goodman-Kruskal tau sys --> ref.                                                                        
                     'CE', # H(ref | sys).                                                                                                  
                     'MI', # Mutual information between ref and sys.                                                                        
                     'NMI', # Normalized mutual information between ref/sys.                                                                
                    ]
        if additional_columns:
            col_names.extend(col_name for col_name, val in additional_columns)

Were these things (FA, M, AE) specific to ldc_sad_hmm I'm guessing?

alecristia commented 5 years ago

DER is a composite; M, FA, AE have to be calculated then summed. Looking quickly at https://github.com/srvk/dscore/blob/master/scorelib/md-eval-22.pl check out this section (lines 2685+):

    my @rows = sort keys %{$score};
    my $miss = "miss";
    $miss .= "0" while exists $score->{$miss};
    my (@cols, %cols);
    my $min_score = $INF;
    foreach $row (@rows) {
    foreach $col (keys %{$score->{$row}}) {
        $min_score = min($min_score,$score->{$row}{$col});
        $cols{$col} = $col;
    }
    }
    @cols = sort keys %cols;
    my $fa = "fa";
    $fa .= "0" while exists $cols{$fa};
    my $reverse_search = @rows < @cols; # search is faster when ncols <= nrows
    foreach $row (@rows) {
    foreach $col (keys %{$score->{$row}}) {
        ($reverse_search ? $cost{$col}{$row} : $cost{$row}{$col})
        = $score->{$row}{$col} - $min_score;
    }
    }
    push @rows, $miss;
    push @cols, $fa;
    if ($reverse_search) {
    my @xr = @rows;
    @rows = @cols;
    @cols = @xr;
    }
riebling commented 5 years ago

oh! I somehow failed to look for code in perl files... this looks like where M FA and AE can be exposed! But will take some hacking...