scality / spark

Apache License 2.0
3 stars 0 forks source link

AttributeError in script s3_fsck_p0.py #3

Open apcurrier opened 2 years ago

apcurrier commented 2 years ago

The job (s3_fsck_p0) script has been running at customer site for about 48 hours. We have had 3 failures in that time with error messages like the ones below.

21/11/09 14:32:54 INFO ContextCleaner: Cleaned accumulator 74

21/11/10 14:02:45 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 280, 10.60.190.21, executor 15): org.apache.spark.SparkException: Task failed while writing rows.
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/root/spark/scripts/./S3_FSCK/s3_fsck_p0.py", line 164, in <lambda>
  File "/root/spark/scripts/./S3_FSCK/s3_fsck_p0.py", line 131, in blob
  File "/root/spark/scripts/./S3_FSCK/s3_fsck_p0.py", line 121, in check_split

AttributeError: 'NoneType' object has no attribute 'zfill'
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
       at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
        ... 10 more

21/11/10 14:02:45 INFO TaskSetManager: Starting task 6.1 in stage 5.0 (TID 284, 10.60.190.21, executor 12, partition 6, PROCESS_LOCAL, 7771 bytes)
21/11/10 14:02:45 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.60.190.21:32857 (size: 63.6 KB, free: 4.9 GB)
21/11/10 14:02:45 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.60.190.21:45752
![spark_error](https://user-images.githubusercontent.com/5631642/141475657-d71f106f-be0f-45ce-8966-f95fe189e441.png)
apcurrier commented 2 years ago

spark_error

ghost commented 2 years ago

This issue appears to have been tracked down to the versioned objects causing an issue.

Initially it seemed the binary/escape sequences delimiting the version from the object was the cause of the failure. However after cleaning out the non printable (binary etc.) content the error still happens when the p0 script is run.

Later we tried wiping out the entirety of the binary, version and whitespace: cat -v keys2.txt | sed 's/\^@.*RG001 .*\;/;/' > clean_keys2.txt

This resolved the NoneType errors. I suspect that the extra whitespace may be an issue since we were not using a quoted csv format.