turicas / rows

A common, beautiful interface to tabular data, no matter the format
GNU Lesser General Public License v3.0
865 stars 136 forks source link

Feature request: add option to disable field type inferencing #351

Open aborruso opened 4 years ago

aborruso commented 4 years ago

Hi, I have an input HTML table in which a field has values as 06583. If I convert this table to CSV, this value becomes 6583.0.

It would be great to have a cli option to completely disable field type inferencing, and to have all output values as text field.

Thank you

turicas commented 4 years ago

By now, you can force the type for a given column to be rows.fields.TextField. Add force_types parameter to the import function: rows.import_from_html(..., force_types={"field_name": rows.fields.TextField}).

aborruso commented 4 years ago

Turicas thank you.

My request is related to command line use. Is it possible to do it via cli?

Best regards

turicas commented 4 years ago

Oh, sorry. Let's reopen it.

turicas commented 4 years ago

Regarding to CLI: unfortunately not all CLI commands support forcing types. Some of them (csv2sqlite, pgimport) support passing a schema file via --schema, but this way you need specify all field types. But there's a hack you could do: convert CSV to SQLite forcing text (or integer or whatever type you want) and then back to CSV. Like this:

Let's say you have a file test.csv with the following contents:

a,b
06583,some text
123,more text

Create the file schema.csv with:

field_name,field_type
a,text
b,text

Then, execute the commands:

rows csv2sqlite --schemas=schema.csv test.csv test-temp.sqlite
rows sqlite2csv test-temp.sqlite test test2.csv
rm test-temp.sqlite

and use test2.csv.

turicas commented 4 years ago

@aborruso if you think the field is being detected incorrectly (it was detected as float while you think it should be detected as int, for example), please open a new issue with data I can reproduce.

aborruso commented 4 years ago

hi @turicas my input is this example HTML

<!DOCTYPE html>
<html>
<body>
<table id="results" border="0" class="regpub_dati c35">
        <tbody>
            <tr class="c28">
                <th class="c27">Beneficiario</th>
                <th class="c27">Comune</th>
                <th class="c27">CAP</th>
                <th class="c27">Provincia </th>
                <th class="c27">Importo</th>
            </tr>

            <tr>
                <td class="c31">RNDFNC60E16</td>
                <td class="c31">RIPACANDIDA</td>
                <td class="c31">85020</td>
                <td class="c31">POTENZA</td>
                <td class="c34">09269</td>
            </tr>

            <tr>
                <td class="c31">RNDFNC60E16</td>
                <td class="c31"></td>
                <td class="c31"></td>
                <td class="c31">POTENZA</td>
                <td class="c34">05269</td>
            </tr>
        </tbody>
        </table>
        </body>
</html>

If I run rows convert input.html output.csv I have

+--------------+-------------+-------+-----------+---------+
| beneficiario | comune      | cap   | provincia | importo |
+--------------+-------------+-------+-----------+---------+
| RNDFNC60E16  | RIPACANDIDA | 85020 | POTENZA   | 9269.0  |
| RNDFNC60E16  | -           | -     | POTENZA   | 5269.0  |
+--------------+-------------+-------+-----------+---------+

The last field is a string and it becomes a float.

I'm not able to create a PR and help you. But rows is really a great tool to convert almost everything to a usable table; my suggestion, my feature request is to add to the cli, an option like -I that disable every kind of field inferencing, that force all field to be text fields. Then it will be the user, after the first import, to apply the right casting.

Thank you