uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

ast Syntax error when parsing non-petastorm dataset #507

Open working-estimate opened 4 years ago

working-estimate commented 4 years ago

Using tensorflow 1.14.0, latest petastorm. My parquet schema is:

 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)
...
 |-- col99: integer (nullable = true)
 |-- target: array (nullable = true)
 |    |-- element: integer (containsNull = true)

The target column is an array column with fixed length. When trying to generate a tensorflow dataset from this, I am getting the following error:


2020-03-17 12:40:17,160 [INFO] Permanently whitelisted: <class 'petastorm.unischema.inferred_schema_view_view'>: constructor
2020-03-17 12:40:17,322 [INFO] Converted call: <function make_petastorm_dataset.<locals>.<lambda> at 0x14663b7a0>
    args: (inferred_schema_view_view(col1=<tf.Tensor 'args_0:0' shape=<unknown> dtype=int32>, col10=<tf.Tensor 'args_1:0' shape=<unknown> dtype=int32>, col11=<tf.Tensor 'args_2:0' shape=<unknown> dtype=int32>, col12=<tf.Tensor 'args_3:0' shape=<unknown> dtype=int32>, col13=<tf.Tensor 'args_4:0' shape=<unknown> dtype=int32>, col14=<tf.Tensor 'args_5:0' shape=<unknown> dtype=int32>, col15=<tf.Tensor 'args_6:0' shape=<unknown> dtype=int32>, col16=<tf.Tensor 'args_7:0' shape=<unknown> dtype=int32>, col17=<tf.Tensor 'args_8:0' shape=<unknown> dtype=int32>, col18=<tf.Tensor 'args_9:0' shape=<unknown> dtype=int32>, col19=<tf.Tensor 'args_10:0' shape=<unknown> dtype=int32>, col2=<tf.Tensor 'args_11:0' shape=<unknown> dtype=int32>, col20=<tf.Tensor 'args_12:0' shape=<unknown> dtype=int32>, col21=<tf.Tensor 'args_13:0' shape=<unknown> dtype=int32>, col22=<tf.Tensor 'args_14:0' shape=<unknown> dtype=int32>, col23=<tf.Tensor 'args_15:0' shape=<unknown> dtype=int32>, col24=<tf.Tensor 'args_16:0' shape=<unknown> dtype=int32>, col25=<tf.Tensor 'args_17:0' shape=<unknown> dtype=int32>, col26=<tf.Tensor 'args_18:0' shape=<unknown> dtype=int32>, col27=<tf.Tensor 'args_19:0' shape=<unknown> dtype=int32>, col28=<tf.Tensor 'args_20:0' shape=<unknown> dtype=int32>, col29=<tf.Tensor 'args_21:0' shape=<unknown> dtype=int32>, col3=<tf.Tensor 'args_22:0' shape=<unknown> dtype=int32>, col30=<tf.Tensor 'args_23:0' shape=<unknown> dtype=int32>, col31=<tf.Tensor 'args_24:0' shape=<unknown> dtype=int32>, col32=<tf.Tensor 'args_25:0' shape=<unknown> dtype=int32>, col33=<tf.Tensor 'args_26:0' shape=<unknown> dtype=int32>, col34=<tf.Tensor 'args_27:0' shape=<unknown> dtype=int32>, col35=<tf.Tensor 'args_28:0' shape=<unknown> dtype=int32>, col36=<tf.Tensor 'args_29:0' shape=<unknown> dtype=int32>, col37=<tf.Tensor 'args_30:0' shape=<unknown> dtype=int32>, col38=<tf.Tensor 'args_31:0' shape=<unknown> dtype=int32>, col39=<tf.Tensor 'args_32:0' shape=<unknown> dtype=int32>, col4=<tf.Tensor 'args_33:0' shape=<unknown> dtype=int32>, col40=<tf.Tensor 'args_34:0' shape=<unknown> dtype=int32>, col41=<tf.Tensor 'args_35:0' shape=<unknown> dtype=int32>, col42=<tf.Tensor 'args_36:0' shape=<unknown> dtype=int32>, col43=<tf.Tensor 'args_37:0' shape=<unknown> dtype=int32>, col44=<tf.Tensor 'args_38:0' shape=<unknown> dtype=int32>, col45=<tf.Tensor 'args_39:0' shape=<unknown> dtype=int32>, col46=<tf.Tensor 'args_40:0' shape=<unknown> dtype=int32>, col47=<tf.Tensor 'args_41:0' shape=<unknown> dtype=int32>, col48=<tf.Tensor 'args_42:0' shape=<unknown> dtype=int32>, col49=<tf.Tensor 'args_43:0' shape=<unknown> dtype=int32>, col5=<tf.Tensor 'args_44:0' shape=<unknown> dtype=int32>, col50=<tf.Tensor 'args_45:0' shape=<unknown> dtype=int32>, col51=<tf.Tensor 'args_46:0' shape=<unknown> dtype=int32>, col52=<tf.Tensor 'args_47:0' shape=<unknown> dtype=int32>, col53=<tf.Tensor 'args_48:0' shape=<unknown> dtype=int32>, col54=<tf.Tensor 'args_49:0' shape=<unknown> dtype=int32>, col55=<tf.Tensor 'args_50:0' shape=<unknown> dtype=int32>, col56=<tf.Tensor 'args_51:0' shape=<unknown> dtype=int32>, col57=<tf.Tensor 'args_52:0' shape=<unknown> dtype=int32>, col58=<tf.Tensor 'args_53:0' shape=<unknown> dtype=int32>, col59=<tf.Tensor 'args_54:0' shape=<unknown> dtype=int32>, col6=<tf.Tensor 'args_55:0' shape=<unknown> dtype=int32>, col60=<tf.Tensor 'args_56:0' shape=<unknown> dtype=int32>, col61=<tf.Tensor 'args_57:0' shape=<unknown> dtype=int32>, col62=<tf.Tensor 'args_58:0' shape=<unknown> dtype=int32>, col63=<tf.Tensor 'args_59:0' shape=<unknown> dtype=int32>, col64=<tf.Tensor 'args_60:0' shape=<unknown> dtype=int32>, col65=<tf.Tensor 'args_61:0' shape=<unknown> dtype=int32>, col66=<tf.Tensor 'args_62:0' shape=<unknown> dtype=int32>, col67=<tf.Tensor 'args_63:0' shape=<unknown> dtype=int32>, col68=<tf.Tensor 'args_64:0' shape=<unknown> dtype=int32>, col69=<tf.Tensor 'args_65:0' shape=<unknown> dtype=int32>, col7=<tf.Tensor 'args_66:0' shape=<unknown> dtype=int32>, col70=<tf.Tensor 'args_67:0' shape=<unknown> dtype=int32>, col71=<tf.Tensor 'args_68:0' shape=<unknown> dtype=int32>, col72=<tf.Tensor 'args_69:0' shape=<unknown> dtype=int32>, col73=<tf.Tensor 'args_70:0' shape=<unknown> dtype=int32>, col74=<tf.Tensor 'args_71:0' shape=<unknown> dtype=int32>, col75=<tf.Tensor 'args_72:0' shape=<unknown> dtype=int32>, col76=<tf.Tensor 'args_73:0' shape=<unknown> dtype=int32>, col77=<tf.Tensor 'args_74:0' shape=<unknown> dtype=int32>, col78=<tf.Tensor 'args_75:0' shape=<unknown> dtype=int32>, col79=<tf.Tensor 'args_76:0' shape=<unknown> dtype=int32>, col8=<tf.Tensor 'args_77:0' shape=<unknown> dtype=int32>, col80=<tf.Tensor 'args_78:0' shape=<unknown> dtype=int32>, col81=<tf.Tensor 'args_79:0' shape=<unknown> dtype=int32>, col82=<tf.Tensor 'args_80:0' shape=<unknown> dtype=int32>, col83=<tf.Tensor 'args_81:0' shape=<unknown> dtype=int32>, col84=<tf.Tensor 'args_82:0' shape=<unknown> dtype=int32>, col85=<tf.Tensor 'args_83:0' shape=<unknown> dtype=int32>, col86=<tf.Tensor 'args_84:0' shape=<unknown> dtype=int32>, col87=<tf.Tensor 'args_85:0' shape=<unknown> dtype=int32>, col88=<tf.Tensor 'args_86:0' shape=<unknown> dtype=int32>, col89=<tf.Tensor 'args_87:0' shape=<unknown> dtype=int32>, col9=<tf.Tensor 'args_88:0' shape=<unknown> dtype=int32>, col90=<tf.Tensor 'args_89:0' shape=<unknown> dtype=int32>, col91=<tf.Tensor 'args_90:0' shape=<unknown> dtype=int32>, col92=<tf.Tensor 'args_91:0' shape=<unknown> dtype=int32>, col93=<tf.Tensor 'args_92:0' shape=<unknown> dtype=int32>, col94=<tf.Tensor 'args_93:0' shape=<unknown> dtype=int32>, col95=<tf.Tensor 'args_94:0' shape=<unknown> dtype=int32>, col96=<tf.Tensor 'args_95:0' shape=<unknown> dtype=int32>, col97=<tf.Tensor 'args_96:0' shape=<unknown> dtype=int32>, col98=<tf.Tensor 'args_97:0' shape=<unknown> dtype=int32>, col99=<tf.Tensor 'args_98:0' shape=<unknown> dtype=int32>, target=<tf.Tensor 'args_99:0' shape=<unknown> dtype=int32>),)
    kwargs: {}

2020-03-17 12:40:17,329 [INFO] Not whitelisted: <method-wrapper '__call__' of function object at 0x14663b7a0>: default rule
2020-03-17 12:40:17,329 [INFO] Not whitelisted: <function make_petastorm_dataset.<locals>.<lambda> at 0x14663b7a0>: default rule
2020-03-17 12:40:17,330 [INFO] Entity <function make_petastorm_dataset.<locals>.<lambda> at 0x14663b7a0> is not cached for key <code object <lambda> at 0x1463d8660, file "/usr/local/lib/python3.7/site-packages/petastorm/tf_utils.py", line 399> subkey (<tensorflow.python.autograph.core.converter.ConversionOptions object at 0x1e731add0>, frozenset({'reader'}))
2020-03-17 12:40:17,330 [INFO] Converting <function make_petastorm_dataset.<locals>.<lambda> at 0x14663b7a0>
2020-03-17 12:40:17,338 [INFO] Error transforming entity <function make_petastorm_dataset.<locals>.<lambda> at 0x14663b7a0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/parser.py", line 78, in parse_entity
    return parse_str(source, preamble_len=len(future_features)), source
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/parser.py", line 139, in parse_str
    module_node = gast.parse(src)
  File "/usr/local/lib/python3.7/site-packages/gast/gast.py", line 240, in parse
    return ast_to_gast(_ast.parse(*args, **kwargs))
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 1
    .map(lambda row: _set_shape_to_named_tuple(reader.schema, row, reader.batched_output))
    ^
SyntaxError: invalid syntax

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 506, in converted_call
    converted_f = conversion.convert(target_entity, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 322, in convert
    free_nonglobal_var_names)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 240, in _convert_with_cache
    entity, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 469, in convert_entity_to_ast
    nodes, name, entity_info = convert_func_to_ast(o, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 630, in convert_func_to_ast
    node, source = parser.parse_entity(f, future_features=future_features)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/parser.py", line 118, in parse_entity
    return parse_str(source, preamble_len=len(future_features)), source
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/parser.py", line 145, in parse_str
    raise ValueError('expected exactly one node node, found {}'.format(nodes))
ValueError: expected exactly one node node, found []```
working-estimate commented 4 years ago

Also this error is in there as well:

INFO:tensorflow:Error transforming entity <function _new_gt_255_compatible_namedtuple at 0x140994440>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 506, in converted_call
    converted_f = conversion.convert(target_entity, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 322, in convert
    free_nonglobal_var_names)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 240, in _convert_with_cache
    entity, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 469, in convert_entity_to_ast
    nodes, name, entity_info = convert_func_to_ast(o, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 669, in convert_func_to_ast
    node = node_to_graph(node, context)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 714, in node_to_graph
    node = converter.apply_(node, context, control_flow)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 409, in apply_
    node = converter_module.transform(node, context)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/converters/control_flow.py", line 578, in transform
    node = ControlFlowTransformer(ctx).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 346, in visit
    return super(Base, self).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/transformer.py", line 480, in visit
    result = super(Base, self).visit(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 273, in visit
    return visitor(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 328, in generic_visit
    value = self.visit(value)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 346, in visit
    return super(Base, self).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/transformer.py", line 480, in visit
    result = super(Base, self).visit(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 273, in visit
    return visitor(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 328, in generic_visit
    value = self.visit(value)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 346, in visit
    return super(Base, self).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/transformer.py", line 480, in visit
    result = super(Base, self).visit(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 273, in visit
    return visitor(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/converters/control_flow.py", line 182, in visit_If
    body_scope, defined_in, node.body)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/converters/control_flow.py", line 121, in _determine_aliased_symbols
    block_live_in = set(anno.getanno(block[0], anno.Static.LIVE_VARS_IN))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/anno.py", line 107, in getanno
    return getattr(node, field_name)[key]
KeyError: LIVE_VARS_IN
2020-03-17 14:25:48,996 [INFO] Error transforming entity <function _new_gt_255_compatible_namedtuple at 0x140994440>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 506, in converted_call
    converted_f = conversion.convert(target_entity, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 322, in convert
    free_nonglobal_var_names)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 240, in _convert_with_cache
    entity, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 469, in convert_entity_to_ast
    nodes, name, entity_info = convert_func_to_ast(o, program_ctx)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 669, in convert_func_to_ast
    node = node_to_graph(node, context)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/conversion.py", line 714, in node_to_graph
    node = converter.apply_(node, context, control_flow)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 409, in apply_
    node = converter_module.transform(node, context)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/converters/control_flow.py", line 578, in transform
    node = ControlFlowTransformer(ctx).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 346, in visit
    return super(Base, self).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/transformer.py", line 480, in visit
    result = super(Base, self).visit(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 273, in visit
    return visitor(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 328, in generic_visit
    value = self.visit(value)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 346, in visit
    return super(Base, self).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/transformer.py", line 480, in visit
    result = super(Base, self).visit(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 273, in visit
    return visitor(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 328, in generic_visit
    value = self.visit(value)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/core/converter.py", line 346, in visit
    return super(Base, self).visit(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/transformer.py", line 480, in visit
    result = super(Base, self).visit(node)
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 273, in visit
    return visitor(node)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/converters/control_flow.py", line 182, in visit_If
    body_scope, defined_in, node.body)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/converters/control_flow.py", line 121, in _determine_aliased_symbols
    block_live_in = set(anno.getanno(block[0], anno.Static.LIVE_VARS_IN))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/pyct/anno.py", line 107, in getanno
    return getattr(node, field_name)[key]
KeyError: LIVE_VARS_IN
selitvin commented 4 years ago

Can you please provide a code example that can be used to reproduce this issue?

working-estimate commented 4 years ago

I am simply using make_batch_reader with a parquet file passed. I solved my issue, even though it didn't seem to be related. If you have an empty parquet file in the folder you are reading, (perhaps due to a filter operation in spark without a repartition afterwards) petastorm will fail, saying it has reached the end of the data. Even though there are further files in the folder that do have data. Repartitioning fixes the issue as it ensures no files are empty, but petastorm should be able to catch this regardless.

selitvin commented 4 years ago

Agreed. Thanks for the report. We'll need to fix this.