javaswift / joss

Java library for OpenStack Storage, aka Swift
http://javaswift.org
117 stars 108 forks source link

Handle empty containers #99

Closed effi-ofer closed 8 years ago

effi-ofer commented 8 years ago

We identified an issue in joss' handling of empty containers and would like to suggest a fix.

Executive summary of the issue:

Joss uses a persistent (keep alive) connection to the object store. This can result in information cached by joss being out of date by the time it is used. Normally this is not an issue, however, when creating an object with multiple parts (such as a parquet object) the cached information is used after some of the objects that make up the parquet have already been created, resulting in incorrect behavior.

Detailed explanation:

When storing a parquet object, there are actually five objects that are created. The first three objects are created first and then a list container is issued, followed up by the creation of the final two objects. If the container where we are saving the parquet was empty to begin with, the list container will return 0 objects, even through 3 objects are already stored in the container by the time the list container is issued. Consequently, we don't end up writing the additional objects that make up the parquet.

How to reproduce:

  1. Create an empty container in swift.
  2. Store a parquet object in the empty container.
  3. The store will complete successfully, but you'll find that only three of the five objects that make up the parquet will be stored in the object store.

You can use the following python code in spark to reproduce the problem: from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * import sys

sc = SparkContext() sqlContext = SQLContext(sc)

if (len(sys.argv) != 2): print "ERROR: This program takes object name as input" sys.exit(0)

objectName = sys.argv[1]

myList = [[1,'a'],[2,'b'],[3,'c'],[4,'d'],[5,'e'],[6,'f']] parallelList = sc.parallelize(myList).collect() schema = StructType([StructField('column1', IntegerType(), False), StructField('column2', StringType(), False)]) df = sqlContext.createDataFrame(parallelList, schema) df.printSchema() df.show(10) dfTarget = df.coalesce(1) dfTarget.write.parquet("swift2d://vault.spark/" + objectName) dfRead = sqlContext.read.parquet("swift2d://vault.spark/" + objectName) dfRead.show()

print "Done!"

How to fix:

The logic in src/main/java/org/javaswift/joss/client/core/AbstractPaginationMap.java list a container by first using the cached information to identify how many objects it can expect. We suggest changing this logic to read from the container until there are no more objects being returned.

effi-ofer commented 8 years ago

Here is a bit more info on the fix, in case I was too sparse in my original submission. When listing a container, we provide a "marker" from the last call, the cursor, and a "limit", which is the count of object names that should be returned. I believe the default for the limit is 10,000 or 9,999, but it does not really matter. The list container returns a list that is bounded by the limit provided. If there are more objects in the container, we will issue another list container. We keep doing this until the list container returns a list of zero items.

effi-ofer commented 8 years ago

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the right to submit it under the Apache License 2.0; or

(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or

(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.

(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.

Signed off by: Effi Ofer