Closed praateekmahajan closed 5 years ago
Suggested (quick) solution is to add an if condition inside the read_next
to ensure that column.name in schema._field.keys()
.
One can have a look here at my fork here.
Seems to me that the problem her is what appears to be a confusing interface. I see that you assumed that
transform = TransformSpec(lambda x : x, removed_fields=["columnA"])
removes "columnA"
from the data-frame for you, while my intent was to indicate that your lambda actually removes the column. I can see how this is confusing. I would expect the code to be like this:
def delColumnA(x):
del x['columnA']
return x
transform = TransformSpec(delColumnA, removed_fields=["columnA"])
I'll prepare a diff to support syntax like this: TransformSpec(removed_fields=["columnA"])
. Does it makes sense to you?
Sent you an invite to collaborate on Petastorm. Would appreciate your help reviewing #417 with the fix. Thanks!
Hey @selitvin, I'd love to review the PR, looks like I don't have the right perms. Also looks like few CircleCI tests are failing!
Sadly won't be able to reproduce the error quickly :
I was trying to trace the error and realised that
reader.schema
: doesn't havecolumnA
arrow_reader_worker.py
callsschema.make_namedtuple(**result_dict)
whereresult_dict
comes from workerworkers_pool.get_results()
. Theresult_dict
seems to havecolumnA
in it.The error is following :