Open drahcos opened 10 years ago
can you post the actual pig statement?
STORE newFiles INTO 'new-Vectors' USING SequenceFileStorage('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.VectorWritableConverter');
While using the SequenceFileLoader (I didn't use it before because I can read the entries with seqdumper) I stored some entries with PigStorage to get a sample. I noticed that the loader doesn't get the values but only the keys. It definitely stores two two things since I can see a tab right after the key but the values are empty. And yes, I noticed that I had the wrong path for VectorWritableConverter but after changing it the problem remains.
Could you also give us the schema of your newFiles relation?:
DESCRIBE newFiles;
On Thu, May 8, 2014 at 1:23 AM, drahcos notifications@github.com wrote:
STORE newFiles INTO 'new-Vectors' USING SequenceFileStorage('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.VectorWritableConverter');
— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42524186 .
DESCRIBE newFiles; newFiles: {key: chararray,dbFiles::value: chararray} <- This is without the SequenceFileLoader,
Btw. I changed SequenceFileStorage now to always choose VectorWritable (job.setOutputValueClass(VectorWritable.class); which works, but I need to load it as VectorWritable and I didn't find a way to do this. Could you tell me where the SequenceFileLoader gets this information? I mean the position where I can directly enter "VectorWritable.class" like I did in SequenceFileStorage so I can force it.
With the SequenceFile Loader: newFiles: {key: chararray,dbFiles::value: chararray} dbFiles is what I load. newFiles is the result of a join with some keys.
Isn't there anything I can do? I really just need to load an tfidf-sequencefile, compare the keys and store some of the entries into a new tfidf-sequencefile. I don't need to manipulate the vectors or anything. Please tell me if there is a way to hardcode it or something else I can do. I'm in real need of this data.
If you don't need to do anything in pig with the vector data, please try out GenericWritableConverter:
On Thu, May 8, 2014 at 1:52 PM, drahcos notifications@github.com wrote:
Isn't there anything I can do? I really just need to load an tfidf-sequencefile, compare the keys and store some of the entries into a new tfidf-sequencefile. I don't need to manipulate the vectors or anything. Please tell me if there is a way to hardcode it or something else I can do. I'm in real need of this data.
— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42604761 .
I'm sorry but that also didn't work. It seems like SequenceFileLoader doesn't accept my input since I only get the standard Text class. Is this input correct?
dbFiles = LOAD 'ready-Vectors/tfidf-vectors' USING com.twitter.elephantbird.pig.load.SequenceFileLoader('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.GenericWritableConverter') AS (key: chararray, value);
You may be missing REGISTER statements-- All supporting jars must be included in the job, otherwise you'll run into class not found errors at runtime.
On Thu, May 8, 2014 at 3:00 PM, drahcos notifications@github.com wrote:
I'm sorry but that also didn't work. It seems like SequenceFileLoader doesn't accept my input since I only get the standard Text class. Is this input correct?
dbFiles = LOAD 'ready-Vectors/tfidf-vectors' USING com.twitter.elephantbird.pig.load.SequenceFileLoader('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.GenericWritableConverter') AS (key: chararray, value);
— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42611736 .
I registered them all: REGISTER $ELEPH_LIBS/elephant-bird-core-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-cascading2-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hive-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hadoop-compat-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-rcfile-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-mahout-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-examples-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-pig-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-crunch-4.4.jar
even the ones I don't need. Do you know where in the code SequenceFileLoader sets the classes to load? I already hardcoded VectorWritable for the store function and I know it worked because when I used the hardcoded version I get:
java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Su
ERROR 2997: Encountered IOException. java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable
java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
When I hardcode it to the Text class I get a perfect sequence file but of course with Text for the value.
I think I'm nearing the limit on my ability to help you with this, but let me go over my assumptions here once more:
You have an existing sequence file dataset, whose keys are Text and values VectorWritable.
You'd like to load this data into pig, filter the (key, value) pairs based on keys, then write the remaining (key, value) pairs back out into another sequence file.
You won't touch the values at all, but need to write them through to output.
If these assumptions are correct, you should be able to do this with something resembling the following:
REGISTER '$ELEPHLIBS/elephant-bird-core-.jar'; REGISTER '$ELEPHLIBS/elephant-bird-pig-.jar'; REGISTER 'path/to/mahout-math.jar';
%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage'; %declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader'; %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; %declare GENERIC_CONVERTER 'com.twitter.elephantbird.pig.util.GenericWritableConverter'; %declare VECTOR_WRITABLE 'org.apache.mahout.math.VectorWritable';
-- load existing data, resulting schema is (key: chararray, value: bytearray) entry = LOAD 'seqfile_data' USING $SEQFILE_LOADER( '-c $TEXT_CONVERTER', '-c $GENERIC_CONVERTER' );
-- filter entries entry_filtered = FILTER entry BY key == 'something';
-- store remaining entries into new sequence file STORE entry_filtered INTO 'seqfile_data_filtered' USING $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c $GENERIC_CONVERTER -t $VECTOR_WRITABLE' );
On Thu, May 8, 2014 at 3:17 PM, drahcos notifications@github.com wrote:
I registered them all: REGISTER $ELEPH_LIBS/elephant-bird-core-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-cascading2-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hive-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hadoop-compat-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-rcfile-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-mahout-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-examples-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-pig-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-crunch-4.4.jar
even the ones I don't need. Do you know where in the code SequenceFileLoader sets the classes to load? I already hardcoded VectorWritable for the store function and I know it worked because when I used the hardcoded version I get: Backend error message
java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Su Pig Stack Trace
ERROR 2997: Encountered IOException. java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable
java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method)
When I hardcode it to the Text class I get a perfect sequence file but of course with Text for the value.
— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42613309 .
Oh my god! you don't know how happy I am right now xD. Everything works perfectly! I need this data for my thesis and I spend so much time on things that were actually not part of my work because so much stuff went wrong. Honestly! Thank you!
Oh! btw. it needs the mahout-core.jar Thank you again! :D
glad that finally things worked fine.. good luck for you thesis. Thanks Andy for helping out.
We need to look into how error message could have been more clear. If a jar is missing, the actual error should be about a missing class. That would have saved much more time.
Hi, I try to extract entries from a tfidf-SequenceFile which I created with seq2sparse. I can read and extract the content but I need to create a new SequenceFile with the entries I extracted. The value needs to be of VectorWritable type (like in seq2sparse tfidf). I tried your SequenceFileStorage with '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' as the second parameter but the output always uses the Text class instead. Is there some way to handle this?
Regards, Richard