Closed filipski closed 4 years ago
Any ideas, at least how to fix that broken metadata for the existing data set?
Was not able to reproduce using your example:
$ ll /tmp/petastorm_ingest_test/_common_metadata
-rw-r--r-- 1 yevgeni uberatc 6774 Mar 26 22:45 /tmp/petastorm_ingest_test/_common_metadata
I also tried using our CI docker image to run your example (was running from petastorm workspace directory):
docker run -it -e PYTHONPATH=/petastorm -v `pwd`:/repro -v `pwd`:/petastorm selitvin/petastorm_ci_auto:ci-2020-31-01-09 bash -c "source /petastorm_venv3.7/bin/activate && python3 /repro/repro_515.py "
(tweaked repro_515.py to fake images: https://gist.github.com/selitvin/7150398b11db9f26c4f03cb9dbf9e679)
Thanks for trying. It's HDFS on the following setup: petastorm==0.8.2 pyspark=2.4.4=py_0
I guess it will be difficult to reproduce, it happened once in couple of hundreds ingestions I've done so far. What you can reproduce, is just the exception on next attempts to append to the set, if you replace your _common_metadata file with the one I attached.
It might be related to a restart of a name+data node of the cluster a few minutes after the ingestion, but it was well after the script finished, so all data should be there on HDFS already (just maybe not replicated to other nodes, I have replication ratio set to 3.0).
Is there an easy way to re-create proper _common_metadata file for an existing data set? I presume it would require scanning through all the files, but it's better than re-creating the whole set from scratch...
Oh, I see. Did not realize this is a one-off thing. A small trick you can use to reconstruct _common_metadata
is just trick materialize_dataset
to redo its recreation like this:
with materialize_dataset(spark, output_url, ImageSchema, ROWGROUP_SIZE_MB, filesystem_factory=resolver.filesystem_factory()):
pass
Hope that works for you...
Thanks. It indeed works, but with some quirks. _common_metadata
file is generated, however if I run the code multiple times (without any changes to the underlying data set given in output_url
), the size of that file grows. I tried using parquet-tools to inspect the content of that file after each execution:
hadoop jar /usr/local/tools/parquet-tools-1.10.1.jar meta /data/petastorm_ingestion_tests/_common_metadata
and it seems the part which grows is this one:
extra: ARROW:schema =
The other fields look stable and seem to contain correct content, namely:
extra: dataset-toolkit.unischema.v1 =
extra: dataset-toolkit.num_row_groups_per_file.v1 =
extra: org.apache.spark.sql.parquet.row.metadata =
I compared the content of dataset-toolkit.num_row_groups_per_file.v1
with the content of the folders in my data set and all unique parquet files are listed there, so this looks fine. org.apache.spark.sql.parquet.row.metadata
also matches my Unischema.
I did some debugging and it looks to me that the content of the ARROW:schema changes in this line:
https://github.com/uber/petastorm/blob/b425e435a5004d56d2618021d9e12fb88b939810/petastorm/utils.py#L117
but I'm not sure what exactly to_arrow_schema()
does.
So, the only thing which worries me a bit now is why does the ARROW:schema =
grow every time I run your empty materialize_dataset
section on unchanged data set. Does it have any negative impact?
I am not sure if there is a negative impact. Would hope not, as long as you are able to read all fields from the dataset. Perhaps, deleting _common_metadata
again and restoring with above method would get rid of whatever extra weight was added to the ARROW:schema? (sorry I do not have indepth understanding into the mechanics of pyarrow usage of _common_metadata)
Thanks, let's close it for now. Final questions - do you experience this growth when appending data to your data sets? Does it affect the performance of your writes or reads?
Personally I have never implemented a scenario of appending data to datasets in our org setup - all datasets were immutable. I assume that all extensions would be done under materialize_dataset
, since Petastorm keeps a list of all row-groups in the metadata and it has to include newly added parquet files. Don't see a reason for a performance degradation, from the top of my head, but I can not say for sure without trying/measuring.
By appending I meant adding new parquet files to the partitioned structure, as in the code above, as the parquets themselves are indeed immutable. Thanks for your help, I'll monitor this to see if it affects the performance when my data set grows bigger.
For some reason the _common_metadata file in my data set has been corrupted. Instead of proper parquet content, it has just 4 bytes (zipped here just due to GitHub limitation for attaching files): _common_metadata.zip Every time I try to append to the set now, I get
pyarrow.lib.ArrowIOError: Invalid Parquet file size is 4 bytes, smaller than standard file footer (8 bytes)
exception:Any idea why did my metadata got corrupted? Have you experienced the same? How to prevent this? How to fix that broken metadata for the existing data set?
The code