Closed franztao closed 2 years ago
Thanks for the suggestion, we're excited to see that there is interest in there being easier re-use of the GriTS metrics. We are still actively developing this repository and adding more documentation and support for other models to use GriTS is one of the things in our current roadmap. We plan to include a function to call GriTS with HTML just like in the example you sent. It should be ready very soon.
Cheers, Brandon
Hi @franztao,
We pushed an update today with a new function grits_from_html(). We'll need to do more testing to make sure it is bug-free but it works on the case in the link you sent. You can use it as follows:
import grits
true_html = "..."
pred_html = "..."
metrics = grits.grits_from_html(true_html, pred_html)
print(metrics)
For the example you linked to, I get the following output:
{
'grits_top': 1.0,
'grits_precision_top': 1.0,
'grits_recall_top': 1.0,
'grits_top_upper_bound': 1.0,
'grits_con': 0.9670250896057349,
'grits_precision_con': 0.9670250896057349,
'grits_recall_con': 0.9670250896057349,
'grits_con_upper_bound': 0.9670250896057349
}
So basically GriTS_Top = 1.0 and GriTS_Con = 0.9670.
Hope this helps!
Best, Brandon
the example do not include metric GriTS_Loc (location)? could you list detailed example with picture to describe how to use those metric?
true_html=<html><body><table><thead><tr><eb></eb><td></td><td></td></tr></thead><tbody><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr></tbody></table></body></html>
pred_html=<html><body><table><thead><tr><eb></eb><td></td><td></td></tr></thead><tbody><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr><tr><td></td><eb></eb><eb></eb></tr><tr><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td></tr></tbody></table></body></html>
ouput bug as below:
@bsmock
Hi @franztao,
Thanks for bringing this to our attention and giving us a chance to discuss this case with you.
Running true_cells = grits.html_to_cells(true_html)
produces the following list of cells parsed from the HTML:
[
{'row_nums': [0], 'column_nums': [0], 'is_column_header': True, 'cell_text': ''},
{'row_nums': [0], 'column_nums': [1], 'is_column_header': True, 'cell_text': ''},
{'row_nums': [1], 'column_nums': [0], 'is_column_header': False, 'cell_text': ''},
{'row_nums': [2], 'column_nums': [0], 'is_column_header': False, 'cell_text': ''},
{'row_nums': [2], 'column_nums': [1], 'is_column_header': False, 'cell_text': ''},
{'row_nums': [2], 'column_nums': [2], 'is_column_header': False, 'cell_text': ''},
...
]
As you can see, the first three rows of the parsed table all have different numbers of columns (or, different numbers of columns that are occupied by a cell). I would say it's ambiguous how to interpret such incomplete HTML as a table. The metric is not designed to handle malformed HTML, so it fails.
For what the metric should do when encountering incomplete/ambiguous HTML, there are a few options we could consider:
Do you have a desired behavior for the metric in cases like this?
HTML can be malformed in other ways. In general, I'm not sure it's obvious what the "right" behavior is. If we anticipate certain kinds of malformed HTML, like in your example with missing cells, we could give the user the option to choose how they want the metric to handle it (possibly choosing among the five options above). But this also has some drawbacks.
Best, Brandon
Hello @bsmock,
Thanks for your reply, I personaly diffuse tag 1 represent the cell with text content, but I proccess all text content to string '1'. represent the cell wihout any content.
<html><body><table><thead><tr><td></td><td>1</td><td>1</td></tr></thead><tbody><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr></tbody></table></body></html>
<html><body><table><thead><tr><td></td><td>1</td><td>1</td></tr></thead><tbody><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td></td><td></td></tr><tr><td>1</td><td>1</td><td>1</td></tr><tr><td>1</td><td>1</td><td>1</td></tr></tbody></table></body></html>
{'grits_top': 1.0, 'grits_precision_top': 1.0, 'grits_recall_top': 1.0, 'grits_top_upper_bound': 1.0, 'grits_con': 1.0, 'grits_precision_con': 1.0, 'grits_recall_con': 1.0, 'grits_con_upper_bound': 1.0}
put of parse picture of the html badcase in website https://verytoolz.com/html-run.html
the code of calculate grids evaluation metric is inserted in the validation code process, it is KISS, not convinient to re-use the metric.