karenjexphd / classification_extraction_tests

Docker images and runtime commands for end-to-end cell classification and table extraction tests
0 stars 0 forks source link

Labels that appear to be correct are not flagged as true positives #2

Closed karenjexphd closed 11 months ago

karenjexphd commented 12 months ago
karenjexphd commented 11 months ago

Appears to be resolved for tables in 2nd example but not those in 1st example. For C10020 we have a single true_positive but see the following in output_label_set and gt_label_set so expect 4 matches:

table_model=# select * from gt_label_set where table_name='C10020_0_0' order by top_row, left_col; table_name | cell_id | left_col | top_row | category_name | label
------------+---------+----------+---------+---------------+-------------------------------- C10020_0_0 | 2117 | 2 | 1 | ColumnHeading | european parliament elec. 2009 C10020_0_0 | 2116 | 3 | 1 | ColumnHeading | european parliament elec. 2004 C10020_0_0 | 2115 | 4 | 1 | ColumnHeading | european parliament elec. 1999 C10020_0_0 | 2114 | 5 | 1 | ColumnHeading | european parliament elec. 1996 C10020_0_0 | 2125 | 1 | 4 | RowHeading1 | national coalition pty C10020_0_0 | 2124 | 1 | 5 | RowHeading1 | centre pty of finland C10020_0_0 | 2123 | 1 | 6 | RowHeading1 | social democr. pty C10020_0_0 | 2122 | 1 | 7 | RowHeading1 | greens C10020_0_0 | 2121 | 1 | 8 | RowHeading1 | true finns C10020_0_0 | 2120 | 1 | 9 | RowHeading1 | swedish people's pty C10020_0_0 | 2119 | 1 | 10 | RowHeading1 | left C10020_0_0 | 2118 | 1 | 11 | RowHeading1 | christian democrats (12 rows)

table_model=# select * from output_label_set where table_method='hypoparsr' and table_name='C10020_0_0' order by top_row, left_col; table_name | table_method | cell_id | left_col | top_row | category_name | label
------------+--------------+---------+----------+---------+---------------+---------------------------------- C10020_0_0 | hypoparsr | 47382 | 1 | 1 | ColumnHeading | Party C10020_0_0 | hypoparsr | 47383 | 2 | 1 | ColumnHeading | European Parliament elec. 2009 C10020_0_0 | hypoparsr | 47384 | 3 | 1 | ColumnHeading | European Parliament elec. 2004   C10020_0_0 | hypoparsr | 47385 | 4 | 1 | ColumnHeading | European Parliament elec. 1999  C10020_0_0 | hypoparsr | 47386 | 5 | 1 | ColumnHeading | European Parliament elec. 1996  (5 rows)

table_model=# select * from label_true_positives where table_method='hypoparsr' and table_name='C10020_0_0'; table_name | table_method | label_true_pos ------------+--------------+---------------- C10020_0_0 | hypoparsr | 1 (1 row)

Closer inspection shows there are trailing spaces in the label values identified by hypoparsr:

select 'xx'||label||'xx' from output_label_set where table_method='hypoparsr' and table_name='C10020_0_0' order by top_row, left_col; ?column?

xxPartyxx xxEuropean Parliament elec. 2009  xx xxEuropean Parliament elec. 2004  xx xxEuropean Parliament elec. 1999 xx xxEuropean Parliament elec. 1996 xx

The is_reconcilable() function needs to be updated to ignore these