Evaluation Measures: Understanding of macro average

Micro P, R, F1:

P, R, F1 on entity level (not on token level): micro average (= over all documents)
- strict and fuzzy (= at least 1 token overlap)
- separately per type and cumulative for all types

Macro as document-level average of micro P, R, F1

P, R, F1 on entity level (not on token level): doc-level macro average (= average of separate micro evaluation on each document)
- strict and fuzzy (= at least 1 token overlap)
- separately per type and cumulative for all types

@e-maud @mromanello: The following type-oriented macro average can be computed from the output of Micro P, R, F1 (spreadsheet style). Therefore the scorer should not directly compute it (for now, at least).

Macro as average over type-specific P, R, F1 measures

P, R, F1 on entity type: doc-level macro average (= average of separate micro evaluation on each document)
- strict and fuzzy (= at least 1 token overlap)

hipe-eval / HIPE-scorer

Evaluation Measures: Understanding of macro average #2

Micro P, R, F1:

Macro as document-level average of micro P, R, F1

Macro as average over type-specific P, R, F1 measures