lensacom / sparkit-learn

PySpark + Scikit-learn = Sparkit-learn
Apache License 2.0
1.15k stars 255 forks source link

Error in Creating DictRdd: Can only zip RDDs with same number of elements in each partition #46

Closed mrshanth closed 9 years ago

mrshanth commented 9 years ago

I am trying to create a DictRdd as follows:

cleanedRdd=sc.sequenceFile(path="hdfs:///bdpilot/text_mining/sequence_input_with_target_v",minSplits=100)
train_rdd,test_rdd = cleanedRdd.randomSplit([0.7,0.3])
train_rdd.saveAsSequenceFile("hdfs:///bdpilot/text_mining/sequence_train_input")

train_rdd = sc.sequenceFile(path="hdfs:///bdpilot3_h/text_mining/sequence_train_input",minSplits=100)

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]))
train_text = train_rdd.map(lambda(x,y): y.split("~")[0])

train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)

But, I get the follwing error, when I do:

train_Z.first()
org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition

I tried the following as well, but with no sucess:

train_y = train_rdd.map(lambda(x,y): int(y.split("~")[1]),perservesPartitioning=True)
train_text = train_rdd.map(lambda(x,y): y.split("~")[0],perservesPartitioning=True)
train_Z = DictRDD((train_text,train_y),columns=('X','y'),bsize=50)
kszucs commented 9 years ago

It looks like spark cannot zip the two RDDs. Did You try train_text.zip(train_y) without creating a DictRDD? This is exactly what splearn does under the hood - probably You will get the same exception.

Also DictRDD accepts RDD of tuples, please try the following:

train_text_y = train_rdd.map(
    lambda (x, y): (y.split("~")[0], int(y.split("~")[1])))

Z_train = DictRDD(train_text_y, columns=('X', 'y'), bsize=50)
mrshanth commented 9 years ago

Thanks. We exactly did the same thing after posting and it worked.