usnistgov / ccu_validation_scoring

Other
5 stars 0 forks source link

Potential bug in score_norm_emotion.py #9

Open islam-nassar opened 1 year ago

islam-nassar commented 1 year ago

In L424 , the sum_scaled_tp and sum_scaled_fp are calculated based on the fhyp which is a filtered version of ihyp.. since you do not accumulate scaled_tp and scaled_fp as you did with tp and fp in L407, these scaled values will be underestimated. A proposed fix is to use ihyp instead of fhyp here or alternatively accumulate scaled_tp and scaled_fp as well.

jfiscus commented 1 year ago

This is not a bug. (I was tripped by this during implementation.) ihyp is the full record of all the ref->system alignments but the loop on L377 generates scores for multiple correctness constraints (in iou_thresholds) by filtering ihyp AND (this is the messy part) the fhyp filter picks the single, last record if records share the same LLR. (This minimizes the number of points in a PR curve). Since cum_tp and cum_fp are made before the filter, all PR curve inflection points are preserved.

islam-nassar commented 1 year ago

Thanks for your reply.

I understand the logic for cum_tp and cum_fp: since you apply cumulative sum before filtering ihyp (L407), you end up with the total tp and fp in the last record so they are preserved after the fhyp filter..

However, the same is not true for the scaled versions (pct_tp and pct_fp) because there is no cumulative sum for them. So pct_tp and pct_fp values for any records which share the same LLR would be lost after the fhyp filter (except for the last one).

Is my understanding correct? or totally wrong?