Closed johankit closed 1 year ago
Few things to check:
What are the contents of /mount_ws/02_distmult/
?
The post-processor is only tested with pyarrow as the parquet backend, so this could be a fastparquet issue. Try using pyarrow instead.
You can directly read the node and relation embeddings with numpy and torch as a workaround. You would just need to apply the node/relation mapping to recover the original ids. Here is the postprocessor implementation for exporting the node and the relation embeddings.
To close this up: Using PyArrow solved the issue! I was now able to export the embeddings in parquet format.
Maybe as an additional info: the containerized export process took 495GB of memory, and 40min runtime for the embeddings of 192GB total in .parquet format.
Thank you!
Describe the bug When trying to export my generated embeddings using the command marius_postprocess, I receive an error that kills the export process.
The exact command I am using is:
marius_postprocess --model_dir /mount_ws/02_distmult --format parquet --output_dir /mount_ws/parquet_export/02_distmult_parquet
Which gives the following error after a while:
How can I interpret this error? Running with .csv file format works, but seems to produce very large files. But I presume .parquet is more efficient..
The packages installed are:
Environment
marius_env_info
output:I'd be glad for any help - Thank you!