petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.24k stars 193 forks source link

Fix fromjson() to support reading from stdin #667

Open yaniv-aknin opened 5 months ago

yaniv-aknin commented 5 months ago

Trivial PR to support reading from stdin (source=None) in fromjson().

Two things you may want me to do before merging -

  1. I didn't add tests because there are no tests I could find for the use of source=None in any io module (e.g., also not in csv etc). Generally, perhaps I missed them but I didn't see tests that fork/exec the bin/petl executable at all...?
  2. Add source=None to other modules. I ran git grep 'def from' | grep source | grep -v source=None and found a few more that miss it, let me know if you'd like me to add it to any/all of them. Modules which appear to be missing source=None: avro, bcolz, pytables (maybe), xml

Checklist

Use this checklist to ensure the quality of pull requests that include new code and/or make changes to existing code.

coveralls commented 5 months ago

Pull Request Test Coverage Report for Build 9050397434

Details


Totals Coverage Status
Change from base Build 8715866607: 0.03%
Covered Lines: 13366
Relevant Lines: 14671

💛 - Coveralls
juarezr commented 5 months ago

Adding to the investigation:

The fromjson() parameters args and kwargs are forwarded to the json library call:

def fromjson(source, *args, **kwargs):
    """..."""
   ...

...
dicts = json.load(f, *self.args, **self.kwargs)
...
juarezr commented 5 months ago

But the json library calls used take only one positional argument followed by others keyword arguments:

 def json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    ...

def  json.loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
   ...

It looks like the args parameter is simply ignored by json.load and json.loads calls.

Now I'm wondering if the function syntax shouldn't be:


def fromjson(source=None, *, **kwargs):
    """..."""
   ...
juarezr commented 5 months ago

@yaniv-aknin,

After looking at the source code, I've concluded that your proposal looks good.

Also, some considerations:

So, what are your plans for this PR?

yaniv-aknin commented 5 months ago

Thanks @juarezr , this sounds good.

  1. I'd be happy to add source=None to more cases, where it'd work easily. Where it'd take a more substantial refactoring (or even just infeasible), I'd leave it unchanged.

  2. I'm mildly indifferent about changing the signature, but don't mind doing it of course. If (sources=None, *, **kwargs) looks good to you, great, I'll do that.

  3. I'm also happy to add some tests that fork/exec petl, and in particular will test the "pipe from stdin" case. I propose to -

    • Create a new test module petl/test/test_script.py
    • Have it find bin/petl in the package directory and fork/exec it using the subprocess module, testing results
    • Add a few simple test cases around that
    • Make the above pass in petl's supported environments (2.7, 3.6-3.12)

Does that sound reasonable and I can update the PR?

juarezr commented 4 months ago

After checking it, I've found that the change in signature won't work:

>>> def fromjson(source=None, *, **kwargs):
...    print("source:", source, "kwargs:", kwargs) 
... 
  File "<stdin>", line 1
SyntaxError: named arguments must follow bare *

So we should keep the args argument as it is currently, and proceed.

Eventually, we could remove the self.args property and the forwarding to json.load functions as it is pointless in:

dicts = json.load(f, *self.args, **self.kwargs)

But it shouldn't be mandatory.

yaniv-aknin commented 4 months ago

I've made some progress, but not all checks pass yet.

I'll try to look into this again next week, but admit I'm beginning to question how much time I can put into this; small fix, lots of overhead :sweat_smile: