moshe / elasticsearch_loader

A tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
MIT License
399 stars 83 forks source link

Unable to load parquet files into local Elasticsearch #79

Closed turboT4 closed 4 years ago

turboT4 commented 4 years ago

So I've just installed elasticsearch-loader according to all the steps provided in docs, and also elasticsearch-loader[parquet].

However, whenever I try to send a parquet file into my local system the following prompts.

Traceback (most recent call last):
  File "/usr/local/bin/elasticsearch_loader", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/elasticsearch_loader/__init__.py", line 159, in _parquet
    load(lines, ctx.obj)
  File "/usr/local/lib/python2.7/site-packages/elasticsearch_loader/__init__.py", line 51, in load
    for i, bulk in enumerate(pbar):
  File "/usr/local/lib/python2.7/site-packages/click/_termui_impl.py", line 285, in generator
    for rv in self.iter:
  File "/usr/local/lib/python2.7/site-packages/parquet/__init__.py", line 420, in DictReader
    for row in reader(file_obj, columns):
  File "/usr/local/lib/python2.7/site-packages/parquet/__init__.py", line 438, in reader
    schema_helper = schema.SchemaHelper(footer.schema)
  File "/usr/local/lib/python2.7/site-packages/parquet/schema.py", line 24, in __init__
    assert len(self.schema_elements) == len(self.schema_elements_by_name)
AssertionError

Any clue on why this could happen?

moshe commented 4 years ago

It seems like an issue in https://github.com/jcrobak/parquet-python , can you paste the parquet schema here?

turboT4 commented 4 years ago

Just read that Array types aren't supported... Is this https://github.com/jcrobak/parquet-python/issues/32 the issue you mentioned?

If so yeah, I do have arrays and inner arrays within my schema. Nested in a Map, Maps, everything! It's quite complete.

moshe commented 4 years ago

So I'm afraid you will not able to load this parquet files with esl, you can try to fork the repo and change parquet-python with fastparquet[1] and submit a pr if it works🙂

[1] https://fastparquet.readthedocs.io/en/latest/

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity (14 days). It will be closed if no further activity occurs in the next 7 days. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had recent activity (21 days). Please reopen it if you feel that the issue not resolved yet.