turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
869 stars 134 forks source link

Enhance `ignore_colspan` behaviour on HTML plugin #298

Open turicas opened 5 years ago

turicas commented 5 years ago

If ignore_colspan=True (default), all lines having a size smaller than the max row size for that table will be ignored. This was created to have the same number of fields but can lead to data loss. The ideal would be get to interpret this information and fill some cells with blanks.

The test HTML can be this one:

<table>
  <thead>
    <tr>
      <th> f1 </th>
      <th> f2 </th>
      <th> f3 </th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td> row0 f1 </td>
      <td> row0 f2 </td>
      <td> row0 f3 </td>
    </tr>
    <tr>
      <td> row1 f1 </td>
      <td colspan="2"> row1 f2-3 </td>
    </tr>
  </tbody>

</table>

And the code:

import rows
for row in rows.import_from_html("test.html"):
    print(row)

The current implementation prints:

Row(f1='row0 f1', f2='row0 f2', f3='row0 f3')

The ideal implementation would print:

Row(f1='row0 f1', f2='row0 f2', f3='row0 f3')
Row(f1='row1 f1', f2='row1 f2-3', f3=None)
yumpyy commented 4 weeks ago

Hello! I came with a hacky but yet simple way to fix this.

    if ignore_colspan:
        max_columns = max(map(len, table_rows))
        table_rows_temp = []
        for row in table_rows:
            table_rows_temp.append(row)
        table_rows = table_rows_temp

    meta = {"imported_from": "html", "source": source}
    return create_table(table_rows, meta=meta, *args, **kwargs)

Output:

Rows:
['f1', 'f2', 'f3']
['row0 f1', 'row0 f2', 'row0 f3']
['row1 f1', 'row1 f2-3']
Row(f1='row0 f1', f2='row0 f2', f3='row0 f3')
Row(f1='row1 f1', f2='row1 f2-3', f3=None)

I appended rows to a separate list (table_rows_temp) and then re-assigned it to table_rows. Apparently due to some post processing done by rows, the missing element is automatically to None. I'm not sure which code block is responsible for assigning None to the missing element but it does work.

I have tested it with other possible cases, and it works there as well.