google-research-datasets / MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.
Other
138 stars 23 forks source link

Understand evaluation in the paper better #6

Open junwang-wish opened 2 years ago

junwang-wish commented 2 years ago

Hi thanks for the repo / data / paper, great work! I am creating this Issue to understand how exactly evaluation is done, since I am using autoregressive formulation of attribute extraction, the attribute extraction is done through free-form text generation, and no attribute type is provided as input.

  1. For positive samples (product paragraphs contain at least one attribute value), for the following example ("target_attribute_vals" contain annotated [attribute value](attribute type) that appears in "text", while "predicted_attribute_vals" contain prediction from a model)

    {
    "text": "Nymph Womens's Chffion Polka Dot Maxi Halter Dress Extra Long",
    "target_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Length) [Halter](Neckline) Dress [Extra Long](Length)",
    "predicted_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Pattern) [Halter](Neckline) [Dress](Type) Extra Long"
    }

    The flattened tuples of target and prediction attribute values would be:

    {
    "target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")], 
    "predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
    }
    • Then following Section 5.3 of the paper, No value (VN), Correct values (VC), Wrong values (VW) for the above would be
      No value (VN): 2 # "Length", "Maxi") and ("Length", "Extra Long") missing
      Correct values (VC): 2 # ("Pattern", "Polka Dot") and ("Neckline", "Halter") correct
      Wrong values (VW): 1 # ("Pattern", "Maxi") is not matching ("Pattern", "Polka Dot")

      The above value counts sum up to 5 attribute value pairs, but the "target_tuples" only had 4 attribute value pairs. Is this expected?

  2. For negative samples (Here I used "target_attributes_as_in_file" instead of "target_attribute_vals" format as in earlier positive examples, and I also added category as shown in the file)

    [
    {
    "text": "Taylor Dresses Women's High Low Lace Shirt Dress", 
    "target_attributes_as_in_file": [{'key': 'Pattern', 'evidences': []}], 
    "predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress", 
    "category": "Dresses"
    }, 
    {
    "text": "Taylor Dresses Women's High Low Lace Shirt Dress Nice", 
    "target_attributes_as_in_file": [{'key': 'Neckline', 'evidences': []}, {'key': 'Pattern', 'evidences': []}, , {'key': 'Type', 'evidences': []}], 
    "predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress",
    "category": "Dresses"
    }
    ]

    Then following Section 5.3 of the paper,

    • No value (NN), some incorrect Value (NV) for the first sample would be
      No value (NN): 1 # No Pattern in "predicted_attribute_vals"
      some incorrect Value (NV): 0 # No Pattern in "predicted_attribute_vals", thus cannot be incorrect
    • No value (NN), some incorrect Value (NV) for the second sample (which is very similar to the first sample) would be
      No value (NN): 2 # No Neckline, Pattern in "predicted_attribute_vals"
      some incorrect Value (NV): 1 # [Dresses](Type) in "predicted_attribute_vals"

      It seems that some incorrect Value (NV) can arbitrarily change depending on "target_attributes_as_in_file" which is not consistent across the same category (every sample under category="Dresses" has different "target_attributes_as_in_file"), is this expected?

liyang2019 commented 2 years ago

Hi,

In the paper, we compute the five cases (VN, VC, VW, NV, NN) at the attribute-product pair level, instead of attribute-product-evidence level. For an attribute-product pair, if a model provides multiple evidences, we select the top prediction by some method, and check whether this top predict belongs to the target evidence set. We count how many attribute-product pairs belong to each case, and compute precision, recall, f1 scores based on the counts.

Question 1

{
    "target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")], 
    "predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
}

The counts should be

VN: 2
- target: ("Length", ["Maxi", "Extra Long"]), predicted: ("Length", "")
VC: 1
- target: ("Pattern", ["Polka Dot"]), predicted: ("Pattern", "Polka Dot") (if your method select "Maxi" as top prediction, then this goes to VW).
- target: ("Neckline", ["Halter"]), predicted: ("Pattern", "Halter")

The prediction of ("Type", "Dress") does not exist if attributes are as inputs to the model. So this kind of generalization hasn't been measured in the paper.

Question 2

Yes, it is expected that attributes in the MAVE dataset are not consistent for a same category across different products. In fact, each category has a predefined set of attributes, but we didn't include attribute-product pairs whose evidences we don't have good confidence in.

Hope the above helps, and please feel free to let me know if there are more questions.