petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.22k stars 190 forks source link

Bug: tojson() tries to read stdin twice, returns empty results #668

Open yaniv-aknin opened 2 months ago

yaniv-aknin commented 2 months ago

What happened?

When using the petl executable, tojson() from stdin returns empty results -

$ petl 'dummytable().head(3).tocsv()' | petl 'fromcsv().tojson()'
[]
$

This is also true for a trivial program, without the executable (see code for repro.py below) -

$ ./repro.py < repro.csv
[]
$

What is the expected behavior?

I'd expect some data.

For example, tocsv() doesn't exhibit this problem -

$ petl 'dummytable().head(3).tocsv()' | petl 'fromcsv().tocsv()'          
foo,bar,baz
82,bananas,0.7873787427711181
3,oranges,0.13771232086689877
13,pears,0.24287642641761387
$

Reproducible test case

This is repro.py. Passing CSV data on stdin will emit an empty JSON array.

#!/usr/bin/env python3

import petl

petl.fromcsv().tojson()

What version of petl are you have found the bug?

v.1.7.15

Version

python 3.12

What OS are you seeing the problem on?

MacOS

What OS version are you using?

No response

What package manager you used to install?

Other

What's the current installed packages?

No response

Relevant log output

No response

Additional Notes

I wasn't sure how to fix it, but I'm pretty sure the bug is that sys.stdin is read twice (this line in tojson() invokes CSVView.__iter__ twice).

The first read depletes the lines from stdin and incorrectly discards the results. I'll try to investigate this further and report here if I do, but I also wanted other folks to be aware of the bug.

Code of Conduct

juarezr commented 2 months ago

It looks like something got wrong with tocsv() because this pattern works in other similar functions :

❯ petl 'dummytable().head(3).tojson()' | petl 'fromjson(source=None)'
+-----+-----------+---------------------+
| foo | bar       | baz                 |
+=====+===========+=====================+
|  65 | 'pears'   |  0.9035174673053437 |
+-----+-----------+---------------------+
|  28 | 'bananas' | 0.38930455757975013 |
+-----+-----------+---------------------+
|  67 | 'oranges' |  0.4340697139584314 |
+-----+-----------+---------------------+

❯ petl 'dummytable().head(3).tojson()' | petl 'fromjson(source=None).tocsv()'
foo,bar,baz
5,bananas,0.6813039656995481
34,pears,0.36447357286299165
47,bananas,0.39808978898927116
yaniv-aknin commented 2 months ago

I don't think so -

$ petl 'dummytable().topickle("test.pkl")'
$ petl 'frompickle().tojson()' < test.pkl 
[]
$

I think the problem is "tojson() reading from a table initialized from stdin", because tojson() iterates the underlying table twice and the underlying table doesn't persist what it read from stdin on the first time.

(none of your examples make tojson() read from something initialized from stdin, like fromcsv() or frompickle())

juarezr commented 2 months ago

Certainly, we need to:

Looking further I've quickly found what looks like another inconsistency:

❯ petl 'dummytable().head(3).tojson()' | petl 'fromjson(source=None).tohtml()'
<table class='petl'>
<thead>
<tr>
<th>foo</th>
<th>bar</th>
<th>baz</th>
</tr>
</thead>
<tbody>
<tr>
<td style='text-align: right'>23</td>
<td>oranges</td>
<td style='text-align: right'>0.5601641490162261</td>
</tr>
<tr>
<td style='text-align: right'>12</td>
<td>oranges</td>
<td style='text-align: right'>0.6160886095315175</td>
</tr>
<tr>
<td style='text-align: right'>23</td>
<td>pears</td>
<td style='text-align: right'>0.8090897047903948</td>
</tr>
</tbody>
</table>
❯ petl 'dummytable().head(3).tojson()' | petl 'fromjson(source=None).toxml()'
Traceback (most recent call last):
  File "/home/juarezr/.virtualenvs/py_petl_v175/bin/petl", line 25, in <module>
    r = eval(expression)
        ^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
AttributeError: 'JsonView' object has no attribute 'toxml'. Did you mean: 'tohtml'?