twitter / elephant-bird

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
Apache License 2.0
1.14k stars 387 forks source link

SeqeuenceFile with VectorWritable #389

Open drahcos opened 10 years ago

drahcos commented 10 years ago

Hi, I try to extract entries from a tfidf-SequenceFile which I created with seq2sparse. I can read and extract the content but I need to create a new SequenceFile with the entries I extracted. The value needs to be of VectorWritable type (like in seq2sparse tfidf). I tried your SequenceFileStorage with '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' as the second parameter but the output always uses the Text class instead. Is there some way to handle this?

Regards, Richard

rangadi commented 10 years ago

can you post the actual pig statement?

drahcos commented 10 years ago

STORE newFiles INTO 'new-Vectors' USING SequenceFileStorage('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.VectorWritableConverter');

drahcos commented 10 years ago

While using the SequenceFileLoader (I didn't use it before because I can read the entries with seqdumper) I stored some entries with PigStorage to get a sample. I noticed that the loader doesn't get the values but only the keys. It definitely stores two two things since I can see a tab right after the key but the values are empty. And yes, I noticed that I had the wrong path for VectorWritableConverter but after changing it the problem remains.

sagemintblue commented 10 years ago

Could you also give us the schema of your newFiles relation?:

DESCRIBE newFiles;

On Thu, May 8, 2014 at 1:23 AM, drahcos notifications@github.com wrote:

STORE newFiles INTO 'new-Vectors' USING SequenceFileStorage('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.VectorWritableConverter');

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42524186 .

drahcos commented 10 years ago

DESCRIBE newFiles; newFiles: {key: chararray,dbFiles::value: chararray} <- This is without the SequenceFileLoader,

Btw. I changed SequenceFileStorage now to always choose VectorWritable (job.setOutputValueClass(VectorWritable.class); which works, but I need to load it as VectorWritable and I didn't find a way to do this. Could you tell me where the SequenceFileLoader gets this information? I mean the position where I can directly enter "VectorWritable.class" like I did in SequenceFileStorage so I can force it.

drahcos commented 10 years ago

With the SequenceFile Loader: newFiles: {key: chararray,dbFiles::value: chararray} dbFiles is what I load. newFiles is the result of a join with some keys.

drahcos commented 10 years ago

Isn't there anything I can do? I really just need to load an tfidf-sequencefile, compare the keys and store some of the entries into a new tfidf-sequencefile. I don't need to manipulate the vectors or anything. Please tell me if there is a way to hardcode it or something else I can do. I'm in real need of this data.

sagemintblue commented 10 years ago

If you don't need to do anything in pig with the vector data, please try out GenericWritableConverter:

https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/util/GenericWritableConverter.java

On Thu, May 8, 2014 at 1:52 PM, drahcos notifications@github.com wrote:

Isn't there anything I can do? I really just need to load an tfidf-sequencefile, compare the keys and store some of the entries into a new tfidf-sequencefile. I don't need to manipulate the vectors or anything. Please tell me if there is a way to hardcode it or something else I can do. I'm in real need of this data.

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42604761 .

drahcos commented 10 years ago

I'm sorry but that also didn't work. It seems like SequenceFileLoader doesn't accept my input since I only get the standard Text class. Is this input correct?

dbFiles = LOAD 'ready-Vectors/tfidf-vectors' USING com.twitter.elephantbird.pig.load.SequenceFileLoader('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.GenericWritableConverter') AS (key: chararray, value);

sagemintblue commented 10 years ago

You may be missing REGISTER statements-- All supporting jars must be included in the job, otherwise you'll run into class not found errors at runtime.

On Thu, May 8, 2014 at 3:00 PM, drahcos notifications@github.com wrote:

I'm sorry but that also didn't work. It seems like SequenceFileLoader doesn't accept my input since I only get the standard Text class. Is this input correct?

dbFiles = LOAD 'ready-Vectors/tfidf-vectors' USING com.twitter.elephantbird.pig.load.SequenceFileLoader('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.GenericWritableConverter') AS (key: chararray, value);

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42611736 .

drahcos commented 10 years ago

I registered them all: REGISTER $ELEPH_LIBS/elephant-bird-core-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-cascading2-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hive-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hadoop-compat-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-rcfile-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-mahout-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-examples-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-pig-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-crunch-4.4.jar

even the ones I don't need. Do you know where in the code SequenceFileLoader sets the classes to load? I already hardcoded VectorWritable for the store function and I know it worked because when I used the hardcoded version I get:

Backend error message

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Su

Pig Stack Trace

ERROR 2997: Encountered IOException. java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

at java.security.AccessController.doPrivileged(Native Method)

When I hardcode it to the Text class I get a perfect sequence file but of course with Text for the value.

sagemintblue commented 10 years ago

I think I'm nearing the limit on my ability to help you with this, but let me go over my assumptions here once more:

You have an existing sequence file dataset, whose keys are Text and values VectorWritable.

You'd like to load this data into pig, filter the (key, value) pairs based on keys, then write the remaining (key, value) pairs back out into another sequence file.

You won't touch the values at all, but need to write them through to output.

If these assumptions are correct, you should be able to do this with something resembling the following:

REGISTER '$ELEPHLIBS/elephant-bird-core-.jar'; REGISTER '$ELEPHLIBS/elephant-bird-pig-.jar'; REGISTER 'path/to/mahout-math.jar';

%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage'; %declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader'; %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; %declare GENERIC_CONVERTER 'com.twitter.elephantbird.pig.util.GenericWritableConverter'; %declare VECTOR_WRITABLE 'org.apache.mahout.math.VectorWritable';

-- load existing data, resulting schema is (key: chararray, value: bytearray) entry = LOAD 'seqfile_data' USING $SEQFILE_LOADER( '-c $TEXT_CONVERTER', '-c $GENERIC_CONVERTER' );

-- filter entries entry_filtered = FILTER entry BY key == 'something';

-- store remaining entries into new sequence file STORE entry_filtered INTO 'seqfile_data_filtered' USING $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c $GENERIC_CONVERTER -t $VECTOR_WRITABLE' );

On Thu, May 8, 2014 at 3:17 PM, drahcos notifications@github.com wrote:

I registered them all: REGISTER $ELEPH_LIBS/elephant-bird-core-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-cascading2-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hive-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hadoop-compat-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-rcfile-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-mahout-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-examples-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-pig-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-crunch-4.4.jar

even the ones I don't need. Do you know where in the code SequenceFileLoader sets the classes to load? I already hardcoded VectorWritable for the store function and I know it worked because when I used the hardcoded version I get: Backend error message

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Su Pig Stack Trace

ERROR 2997: Encountered IOException. java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method)

When I hardcode it to the Text class I get a perfect sequence file but of course with Text for the value.

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42613309 .

drahcos commented 10 years ago

Oh my god! you don't know how happy I am right now xD. Everything works perfectly! I need this data for my thesis and I spend so much time on things that were actually not part of my work because so much stuff went wrong. Honestly! Thank you!

drahcos commented 10 years ago

Oh! btw. it needs the mahout-core.jar Thank you again! :D

rangadi commented 10 years ago

glad that finally things worked fine.. good luck for you thesis. Thanks Andy for helping out.

We need to look into how error message could have been more clear. If a jar is missing, the actual error should be about a missing class. That would have saved much more time.