ogrisel / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
158 stars 64 forks source link

Error Building a corpus from Italian Wikipedia #7

Open raymanrt opened 13 years ago

raymanrt commented 13 years ago

Hi, the command given is: pig-0.8.1/bin/pig -x local -p PIGNLPROC_JAR=pignlproc/target/pignlproc-0.1.0-SNAPSHOT.jar -p LANG=it -p INPUT=/home/rayman/Scrivania/wiki_dump/itwiki-latest-pages-articles.xml -p OUTPUT=workspace pignlproc/examples/ner-corpus/01_extract_sentences_with_links.pig

With pig-0.8.1 seems to work well also with only one chunk of the dump, so I decided to process the whole dump (I have only one machine but there's no hurry. After a couple of hour of processing, the error is the following:

2011-08-31 11:45:25,856 [Thread-624] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0003 java.io.IOException: Illegal partition for Null: false index: 0 (http://it.wikipedia.org/wiki/Regione_di_Worodougou,Diocesi di Odienné,4) (3) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:904) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:541) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:239) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 2011-08-31 11:45:26,970 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local_0003 has failed! Stop running all dependent jobs 2011-08-31 11:45:26,972 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2011-08-31 11:45:26,973 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2011-08-31 11:45:26,973 [main] INFO org.apache.pig.tools.pigstats.PigStats - Detected Local mode. Stats reported below may be incomplete 2011-08-31 11:45:26,975 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2 0.8.1 rayman 2011-08-31 11:09:15 2011-08-31 11:45:26 ORDER_BY,FILTER

Some jobs have failed! Stop running all dependent jobs

Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0001 noredirect,parsed,sentences,stored MAP_ONLY
job_local_0002 ordered SAMPLER

Failed Jobs: JobId Alias Feature Message Outputs job_local_0003 ordered ORDER_BY Message: Job failed! file:///home/rayman/ner-training-itwiki/workspace/it/sentences_with_links,

Input(s): Successfully read records from: "/home/rayman/Scrivania/wiki_dump/itwiki-latest-pages-articles.xml"

Output(s): Failed to produce result in "file:///home/rayman/ner-training-itwiki/workspace/it/sentences_with_links"

Job DAG: job_local_0001 -> job_local_0002, job_local_0002 -> job_local_0003, job_local_0003

2011-08-31 11:45:26,975 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 11:45:26,977 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 11:45:26,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs 2011-08-31 11:45:26,980 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 11:45:26,984 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message Details at logfile: /home/rayman/ner-training-itwiki/pig_1314781753331.log

And the log file says:

Pig Stack Trace

ERROR 2244: Job failed, hadoop does not return any error message

org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:119) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:500)

at org.apache.pig.Main.main(Main.java:107)

pig_1314781753331.log (END)

What do you think about it?

Riccardo

ogrisel commented 13 years ago

Hum there does not seem to be pignlproc related packages in the stacktrace... Is this error random or systematically reproduced?

raymanrt commented 13 years ago

Executing the same script on a different machine gives me the following excepiton:

2011-08-31 14:07:11,305 [Thread-622] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 100 2011-08-31 14:07:11,325 [Thread-622] INFO org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720 2011-08-31 14:07:11,325 [Thread-622] INFO org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680 2011-08-31 14:07:11,326 [Thread-622] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 14:07:11,326 [Thread-622] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2011-08-31 14:07:11,327 [Thread-622] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2011-08-31 14:07:11,736 [Thread-622] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0003 java.io.IOException: Illegal partition for Null: false index: 0 (http://it.wikipedia.org/wiki/Eccitone,Scintillatore,15) (1) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:904) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:541) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:239) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 2011-08-31 14:07:14,917 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local_0003 has failed! Stop running all dependent jobs 2011-08-31 14:07:14,919 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2011-08-31 14:07:14,919 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2011-08-31 14:07:14,919 [main] INFO org.apache.pig.tools.pigstats.PigStats - Detected Local mode. Stats reported below may be incomplete 2011-08-31 14:07:14,922 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2 0.8.1 brainaetic 2011-08-31 13:41:30 2011-08-31 14:07:14 ORDER_BY,FILTER

Some jobs have failed! Stop running all dependent jobs

Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0001 noredirect,parsed,sentences,stored MAP_ONLY job_local_0002 ordered SAMPLER

Failed Jobs: JobId Alias Feature Message Outputs job_local_0003 ordered ORDER_BY Message: Job failed! file:///home/brainaetic/rayman/ner-training-itwiki/workspace/it/sentences_with_links,

Input(s): Successfully read records from: "file:///home/brainaetic/rayman/itwiki-latest-pages-articles.xml"

Output(s): Failed to produce result in "file:///home/brainaetic/rayman/ner-training-itwiki/workspace/it/sentences_with_links"

Job DAG: job_local_0001 -> job_local_0002, job_local_0002 -> job_local_0003, job_local_0003

2011-08-31 14:07:14,922 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 14:07:14,924 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 14:07:14,924 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs 2011-08-31 14:07:14,927 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2011-08-31 14:07:14,932 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message Details at logfile: /home/brainaetic/rayman/ner-training-itwiki/pig_1314790889277.log

Pig Stack Trace

ERROR 2244: Job failed, hadoop does not return any error message

org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:119) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:500) at org.apache.pig.Main.main(Main.java:107)

ogrisel commented 13 years ago

Unfortunately I have no idea what's happening. The best way to proceed would be to isolate the few Wikipedia articles that trigger the failure (assuming they are always the same) in a unit tests to be able to use the debugger and trace the origin of the issue.

raymanrt commented 13 years ago

Third execution went wrong with:

java.io.IOException: Illegal partition for Null: false index: 0 (http://it.wikipedia.org/wiki/Regno_di_Sardegna,Santa Margherita di Staffora,13) (3)

Let's try another one run, but they are all different pages by now...

raymanrt commented 13 years ago

And again:

java.io.IOException: Illegal partition for Null: false index: 0 (http://it.wikipedia.org/wiki/Repubblica_Socialista_Federale_di_Jugoslavia,Luciano Sušanj,2) (3)

renaud commented 12 years ago

same here:


2012-02-28 19:31:51,008 [Thread-1469] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
java.io.IOException: Illegal partition for Null: false index: 0 (http://fr.wikipedia.org/wiki/Casimiro_Nay,Projet:Football/Index/C,1) (1)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:904)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:541)
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:239)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
0.20.2  0.8.1   richarde    2012-02-28 18:24:06 2012-02-28 19:31:54 ORDER_BY,FILTER

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_local_0003  ordered ORDER_BY    Message: Job failed!    file:///pignlproc/output/wiki_dump_parsed/fr/sentences_with_links,
renaud commented 12 years ago

turning off sorting for now

diff --git a/examples/ner-corpus/01_extract_sentences_with_links.pig b/examples/ner-corpus/01_extract_sentences_with_links.pig
index ead569e..2767b39 100644
--- a/examples/ner-corpus/01_extract_sentences_with_links.pig
+++ b/examples/ner-corpus/01_extract_sentences_with_links.pig
@@ -28,6 +28,8 @@ sentences = FOREACH projected
 stored = FOREACH sentences
   GENERATE title, sentenceOrder, linkTarget, linkBegin, linkEnd, sentence;

+STORE stored INTO '$OUTPUT/$LANG/sentences_with_links_unordered';
+
 -- Ensure ordering for fast merge with type info later
-ordered = ORDER stored BY linkTarget ASC, title ASC, sentenceOrder ASC;
-STORE ordered INTO '$OUTPUT/$LANG/sentences_with_links';
+-- ordered = ORDER stored BY linkTarget ASC, title ASC, sentenceOrder ASC;
+-- STORE ordered INTO '$OUTPUT/$LANG/sentences_with_links';
renaud commented 12 years ago

for the record, changing to hadoop-0.20.2 (I tried before hadoop-0.20.205.0 and hadoop-0.23.1) and switching to single node setup (instead of local) worked for me.

ogrisel commented 12 years ago

Hum, so this might be a pig / hadoop versioning bug?

renaud commented 12 years ago

I would assume...

qwaider commented 8 years ago

In that pig file set default_parallel to 2 would fix the bug for the local test.

Bests, Mohammed Qwaider