Open paulhoule opened 10 years ago
The command line I am using to test smushing is
haruhi run job -clusterId tinyAwsCluster -jarId bakemonoJar smushSubject -input s3n://subjective-eye/0.9/subjectiveEye3D -sameAs s3n://basekb-sandbox/dbpediaMap -output s3n://basekb-sandbox/tryAgain
and superficially it looked like this worked. (Meaning it failed before because I entered the wrong command last time)
Actually when I look at the file I see it did not really work. There were two symptoms: one was that partition 0 was 74 MB in size while all the others were 4 MB (partitioning went wrong) and the other was that the output files all looked like
com.ontology2.bakemono.joins.TaggedTextItem@47b1eac3 com.ontology2.bakemono.joins.TaggedTextItem@f0a4d937
com.ontology2.bakemono.joins.TaggedTextItem@8ef17379 com.ontology2.bakemono.joins.TaggedTextItem@9a2ae44b
com.ontology2.bakemono.joins.TaggedTextItem@bdc97798 com.ontology2.bakemono.joins.TaggedTextItem@16c82b5f
Now that I think of it, I think we would up with the reducer count being set to zero because these are part-m
files. Last time (when I got no output, I did set the -R explicitly), so let's try:
haruhi run job -clusterId tinyAwsCluster -jarId bakemonoJar smushSubject -input s3n://subjective-eye/0.9/subjectiveEye3D -sameAs s3n://basekb-sandbox/dbpediaMap -output s3n://basekb-sandbox/tryAgain2 -R 1
probably I should set a non-zero default for R since any map-only job isn't going to have the R parameter.
yep, with R=1 we get nothing out. I'm tempted to make a little sample file and add chatty logging to see what exactly goes through the reducer.
Ok, to test this one I do "grep Killzone" on both the map and eye files and I upload them to S3. This creates a tiny knowledge base mainly about the Killzone game series. I see multiple problems though...
early on I see
http://www.w3.org/2001/XMLSchema#float> .] with tag [16]
2014-03-06 23:37:44,354 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got key value [<http://dbpedia.org/resource/Killzone_3>] with tag [16]
2014-03-06 23:37:44,354 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got value value [<http://dbpedia.org/resource/Killzone_3> <http://rdf.basekb.com/public/subjectiveEye3D> "2.475553E-4"^^<http://www.w3.org/2001/XMLSchema#float> .] with tag [16]
but then later on I see the dbpedia map entry coming in another batch
2014-03-06 23:37:44,370 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got key value [part-m-00027.gz:<http://dbpedia.org/resource/Killzone_3>] with tag [2]
2014-03-06 23:37:44,370 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got value value [part-m-00027.gz:<http://dbpedia.org/resource/Killzone_3> <http://www.w3.org/2002/07/owl#sameAs> <http://rdf.basekb.com/ns/m.0bh78p7> .] with tag [2]
so there are two screw-ups here. One is that when I used grep on the test data it inserted the file names! The other one is that we are getting 2 for the tag instead of 1.
My last attempt at smushing the subjects for :SubjectiveEye3D failed, so I am now trying to do it right.