Open junwang-wish opened 2 years ago
Hi,
In the paper, we compute the five cases (VN, VC, VW, NV, NN) at the attribute-product pair level, instead of attribute-product-evidence level. For an attribute-product pair, if a model provides multiple evidences, we select the top prediction by some method, and check whether this top predict belongs to the target evidence set. We count how many attribute-product pairs belong to each case, and compute precision, recall, f1 scores based on the counts.
{
"target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")],
"predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
}
The counts should be
VN: 2
- target: ("Length", ["Maxi", "Extra Long"]), predicted: ("Length", "")
VC: 1
- target: ("Pattern", ["Polka Dot"]), predicted: ("Pattern", "Polka Dot") (if your method select "Maxi" as top prediction, then this goes to VW).
- target: ("Neckline", ["Halter"]), predicted: ("Pattern", "Halter")
The prediction of ("Type", "Dress") does not exist if attributes are as inputs to the model. So this kind of generalization hasn't been measured in the paper.
Yes, it is expected that attributes in the MAVE dataset are not consistent for a same category
across different products.
In fact, each category has a predefined set of attributes, but we didn't include attribute-product pairs whose evidences we don't have good confidence in.
Hope the above helps, and please feel free to let me know if there are more questions.
Hi thanks for the repo / data / paper, great work! I am creating this Issue to understand how exactly evaluation is done, since I am using autoregressive formulation of attribute extraction, the attribute extraction is done through free-form text generation, and no attribute type is provided as input.
For positive samples (product paragraphs contain at least one attribute value), for the following example (
"target_attribute_vals" contain annotated [attribute value](attribute type) that appears in "text", while "predicted_attribute_vals" contain prediction from a model
)The flattened tuples of target and prediction attribute values would be:
The above value counts sum up to 5 attribute value pairs, but the
"target_tuples"
only had 4 attribute value pairs. Is this expected?For negative samples (Here I used
"target_attributes_as_in_file"
instead of"target_attribute_vals"
format as in earlier positive examples, and I also added category as shown in the file)Then following Section 5.3 of the paper,
It seems that
some incorrect Value (NV)
can arbitrarily change depending on"target_attributes_as_in_file"
which is not consistent across the samecategory
(every sample undercategory="Dresses"
has different"target_attributes_as_in_file"
), is this expected?