marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Error during parquet embedding export #128

Closed johankit closed 1 year ago

johankit commented 1 year ago

Describe the bug When trying to export my generated embeddings using the command marius_postprocess, I receive an error that kills the export process.

The exact command I am using is:

marius_postprocess --model_dir /mount_ws/02_distmult --format parquet --output_dir /mount_ws/parquet_export/02_distmult_parquet

Which gives the following error after a while:

Traceback (most recent call last):
  File "/usr/local/bin/marius_postprocess", line 11, in <module>
    load_entry_point('marius==0.0.2', 'console_scripts', 'marius_postprocess')()
  File "/usr/local/lib/python3.6/dist-packages/marius/tools/marius_postprocess.py", line 61, in main
    exporter.export(output_dir)
  File "/usr/local/lib/python3.6/dist-packages/marius/tools/postprocess/in_memory_exporter.py", line 176, in export
    self.export_node_embeddings(output_dir)
  File "/usr/local/lib/python3.6/dist-packages/marius/tools/postprocess/in_memory_exporter.py", line 83, in export_node_embeddings
    self.overwrite,
  File "/usr/local/lib/python3.6/dist-packages/marius/tools/postprocess/in_memory_exporter.py", line 37, in save_df
    output_df.to_parquet(output_path)
  File "/usr/local/lib/python3.6/dist-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2372, in to_parquet
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parquet.py", line 276, in to_parquet
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parquet.py", line 199, in write
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 951, in write
    partition_cols=partition_on)
  File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 750, in make_metadata
    object_encoding=oencoding, times=times)
  File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 116, in find_type
    object_encoding = infer_object_encoding(data)
  File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 322, in infer_object_encoding
    for i in head if i):
  File "/usr/local/lib/python3.6/dist-packages/fastparquet/writer.py", line 322, in <genexpr>
    for i in head if i):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

How can I interpret this error? Running with .csv file format works, but seems to produce very large files. But I presume .parquet is more efficient..

The packages installed are:

antlr4-python3-runtime (4.9.3)
asn1crypto (0.24.0)
cramjam (2.3.2)
cryptography (2.1.4)
dataclasses (0.8)
fastparquet (0.7.2)
fsspec (2022.1.0)
GPUtil (1.4.0)
idna (2.6)
importlib-metadata (4.8.3)
keyring (10.6.0)
keyrings.alt (3.0)
marius (0.0.2)
numpy (1.19.5)
omegaconf (2.2.3)
pandas (1.1.5)
pip (9.0.1)
psutil (5.9.2)
py4j (0.10.9.5)
pycrypto (2.6.1)
pygobject (3.26.1)
pyspark (3.2.2)
python-apt (1.6.5+ubuntu0.7)
python-dateutil (2.8.2)
pytz (2022.4)
pyxdg (0.25)
PyYAML (6.0)
SecretStorage (2.3.1)
setuptools (39.0.1)
six (1.11.0)
thrift (0.16.0)
torch (1.9.1+cu111)
typing-extensions (4.1.1)
unattended-upgrades (0.1)
wheel (0.30.0)
zipp (3.6.0)

Environment marius_env_info output:

cmake:
  version: 3.20.0
cpu_info:
  num_cpus: 96
  total_memory: 377GB
cuda:
  version: '11.1'
gpu_info:
  - memory: 40GB
    name: NVIDIA A100-PCIE-40GB
marius:
  bindings_installed: true
  install_path: /usr/local/lib/python3.6/dist-packages/marius
  version: 0.0.2
openmp:
  version: '201511'
operating_system:
  platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-Ubuntu-18.04-bionic
pybind:
  PYBIND11_BUILD_ABI: _cxxabi1011
  PYBIND11_COMPILER_TYPE: _gcc
  PYBIND11_STDLIB: _libstdcpp
python:
  compiler: GCC 8.4.0
  deps:
    numpy_version: 1.19.5
    omegaconf_version: 2.2.3
    pandas_version: 1.1.5
    pip_version: 9.0.1
    pyspark_version: 3.2.2
    pytest_version: 7.0.1
    torch_version: 1.9.1+cu111
    tox_version: 3.28.0
  version: 3.6.9
pytorch:
  install_path: /usr/local/lib/python3.6/dist-packages/torch
  version: 1.9.1+cu111

I'd be glad for any help - Thank you!

JasonMoho commented 1 year ago

Few things to check:

  1. What are the contents of /mount_ws/02_distmult/?

  2. The post-processor is only tested with pyarrow as the parquet backend, so this could be a fastparquet issue. Try using pyarrow instead.

  3. You can directly read the node and relation embeddings with numpy and torch as a workaround. You would just need to apply the node/relation mapping to recover the original ids. Here is the postprocessor implementation for exporting the node and the relation embeddings.

johankit commented 1 year ago

To close this up: Using PyArrow solved the issue! I was now able to export the embeddings in parquet format.

Maybe as an additional info: the containerized export process took 495GB of memory, and 40min runtime for the embeddings of 192GB total in .parquet format.

Thank you!