paulhoule / infovore

RDF-Centric Map/Reduce Framework and Freebase data conversion tool
Other
148 stars 21 forks source link

Integration test subject smush #111

Open paulhoule opened 10 years ago

paulhoule commented 10 years ago

My last attempt at smushing the subjects for :SubjectiveEye3D failed, so I am now trying to do it right.

paulhoule commented 10 years ago

The command line I am using to test smushing is

haruhi run job -clusterId tinyAwsCluster -jarId bakemonoJar smushSubject -input s3n://subjective-eye/0.9/subjectiveEye3D -sameAs s3n://basekb-sandbox/dbpediaMap -output s3n://basekb-sandbox/tryAgain

and superficially it looked like this worked. (Meaning it failed before because I entered the wrong command last time)

paulhoule commented 10 years ago

Actually when I look at the file I see it did not really work. There were two symptoms: one was that partition 0 was 74 MB in size while all the others were 4 MB (partitioning went wrong) and the other was that the output files all looked like

com.ontology2.bakemono.joins.TaggedTextItem@47b1eac3    com.ontology2.bakemono.joins.TaggedTextItem@f0a4d937
com.ontology2.bakemono.joins.TaggedTextItem@8ef17379    com.ontology2.bakemono.joins.TaggedTextItem@9a2ae44b
com.ontology2.bakemono.joins.TaggedTextItem@bdc97798    com.ontology2.bakemono.joins.TaggedTextItem@16c82b5f

Now that I think of it, I think we would up with the reducer count being set to zero because these are part-m files. Last time (when I got no output, I did set the -R explicitly), so let's try:

haruhi run job -clusterId tinyAwsCluster -jarId bakemonoJar smushSubject -input s3n://subjective-eye/0.9/subjectiveEye3D -sameAs s3n://basekb-sandbox/dbpediaMap -output s3n://basekb-sandbox/tryAgain2 -R 1

probably I should set a non-zero default for R since any map-only job isn't going to have the R parameter.

paulhoule commented 10 years ago

yep, with R=1 we get nothing out. I'm tempted to make a little sample file and add chatty logging to see what exactly goes through the reducer.

paulhoule commented 10 years ago

Ok, to test this one I do "grep Killzone" on both the map and eye files and I upload them to S3. This creates a tiny knowledge base mainly about the Killzone game series. I see multiple problems though...

early on I see

http://www.w3.org/2001/XMLSchema#float> .] with tag [16]
2014-03-06 23:37:44,354 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got key value [<http://dbpedia.org/resource/Killzone_3>] with tag [16]
2014-03-06 23:37:44,354 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got value value [<http://dbpedia.org/resource/Killzone_3>  <http://rdf.basekb.com/public/subjectiveEye3D>  "2.475553E-4"^^<http://www.w3.org/2001/XMLSchema#float> .] with tag [16]

but then later on I see the dbpedia map entry coming in another batch

2014-03-06 23:37:44,370 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got key value [part-m-00027.gz:<http://dbpedia.org/resource/Killzone_3>] with tag [2]
2014-03-06 23:37:44,370 INFO com.ontology2.bakemono.rewriteSubject.RewriteSubjectReducer (main): Got value value [part-m-00027.gz:<http://dbpedia.org/resource/Killzone_3>  <http://www.w3.org/2002/07/owl#sameAs>  <http://rdf.basekb.com/ns/m.0bh78p7>    .] with tag [2]

so there are two screw-ups here. One is that when I used grep on the test data it inserted the file names! The other one is that we are getting 2 for the tag instead of 1.