Understand evaluation in the paper better

Hi thanks for the repo / data / paper, great work! I am creating this Issue to understand how exactly evaluation is done, since I am using autoregressive formulation of attribute extraction, the attribute extraction is done through free-form text generation, and no attribute type is provided as input.

For positive samples (product paragraphs contain at least one attribute value), for the following example ("target_attribute_vals" contain annotated [attribute value](attribute type) that appears in "text", while "predicted_attribute_vals" contain prediction from a model)

{
"text": "Nymph Womens's Chffion Polka Dot Maxi Halter Dress Extra Long",
"target_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Length) [Halter](Neckline) Dress [Extra Long](Length)",
"predicted_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Pattern) [Halter](Neckline) [Dress](Type) Extra Long"
}

The flattened tuples of target and prediction attribute values would be:

{
"target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")], 
"predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
}

Then following Section 5.3 of the paper, No value (VN), Correct values (VC), Wrong values (VW) for the above would be
```
No value (VN): 2 # "Length", "Maxi") and ("Length", "Extra Long") missing
Correct values (VC): 2 # ("Pattern", "Polka Dot") and ("Neckline", "Halter") correct
Wrong values (VW): 1 # ("Pattern", "Maxi") is not matching ("Pattern", "Polka Dot")
```
The above value counts sum up to 5 attribute value pairs, but the "target_tuples" only had 4 attribute value pairs. Is this expected?

For negative samples (Here I used "target_attributes_as_in_file" instead of "target_attribute_vals" format as in earlier positive examples, and I also added category as shown in the file)

[
{
"text": "Taylor Dresses Women's High Low Lace Shirt Dress", 
"target_attributes_as_in_file": [{'key': 'Pattern', 'evidences': []}], 
"predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress", 
"category": "Dresses"
}, 
{
"text": "Taylor Dresses Women's High Low Lace Shirt Dress Nice", 
"target_attributes_as_in_file": [{'key': 'Neckline', 'evidences': []}, {'key': 'Pattern', 'evidences': []}, , {'key': 'Type', 'evidences': []}], 
"predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress",
"category": "Dresses"
}
]

Then following Section 5.3 of the paper,

No value (NN), some incorrect Value (NV) for the first sample would be

No value (NN): 1 # No Pattern in "predicted_attribute_vals"
some incorrect Value (NV): 0 # No Pattern in "predicted_attribute_vals", thus cannot be incorrect

No value (NN), some incorrect Value (NV) for the second sample (which is very similar to the first sample) would be
```
No value (NN): 2 # No Neckline, Pattern in "predicted_attribute_vals"
some incorrect Value (NV): 1 # [Dresses](Type) in "predicted_attribute_vals"
```
It seems that some incorrect Value (NV) can arbitrarily change depending on "target_attributes_as_in_file" which is not consistent across the same category (every sample under category="Dresses" has different "target_attributes_as_in_file"), is this expected?

Hi,

In the paper, we compute the five cases (VN, VC, VW, NV, NN) at the attribute-product pair level, instead of attribute-product-evidence level. For an attribute-product pair, if a model provides multiple evidences, we select the top prediction by some method, and check whether this top predict belongs to the target evidence set. We count how many attribute-product pairs belong to each case, and compute precision, recall, f1 scores based on the counts.

Question 1

{
    "target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")], 
    "predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
}

The counts should be

VN: 2
- target: ("Length", ["Maxi", "Extra Long"]), predicted: ("Length", "")
VC: 1
- target: ("Pattern", ["Polka Dot"]), predicted: ("Pattern", "Polka Dot") (if your method select "Maxi" as top prediction, then this goes to VW).
- target: ("Neckline", ["Halter"]), predicted: ("Pattern", "Halter")

The prediction of ("Type", "Dress") does not exist if attributes are as inputs to the model. So this kind of generalization hasn't been measured in the paper.

Question 2

Yes, it is expected that attributes in the MAVE dataset are not consistent for a same category across different products. In fact, each category has a predefined set of attributes, but we didn't include attribute-product pairs whose evidences we don't have good confidence in.

Hope the above helps, and please feel free to let me know if there are more questions.

google-research-datasets / MAVE

Understand evaluation in the paper better #6

Question 1

Question 2