majdigital / bigworldgraph

Revealing the connections that shape the world we live in.
http://bigworldgraph-stage.maj.digital/
MIT License
4 stars 0 forks source link

Error when running the pipeline #38

Open pbouda opened 5 years ago

pbouda commented 5 years ago

I got this error after running the pipeline as described in the docs:

FrenchServerDependencyParseTask processing article 'Affaire des march�s publics d'�le-de-France'...
FrenchServerDependencyParseTask finished sentence #194974/00001.ERROR: [pid 23] Worker Worker(salt=800150576, workers=1, host=b3a5ccfc6877, username=root, pid=23) failed    FrenchServerDependencyParseTask(task_config={"PIPELINE_DEBUG": true, "CORPUS_ENCODING": "utf-8", "STANFORD_CORENLP_SERVER_ADDRESS": "http://stanford:9000", "LANGUAGE_ABBREVIATION": "fr", "ONLY_INCLUDE_RELEVANT_SENTENCES": true, "ONLY_INCLUDE_RELEVANT_ARTICLES": true, "SENTENCE_TOKENIZER_PATH": "tokenizers/punkt/PY3/french.pickle", "CORPUS_INPATH": "/french_corpora/corpus_affairs_modern_french_in_france.xml", "WIKIPEDIA_ARTICLE_TAG_PATTERN": "<doc id=\"(\\d+)\" url=\"(.+?)\" title=\"(.+?)\">", "WIKIPEDIA_READING_OUTPUT_PATH": "/french_pipeline/fr_articles.json", "STANFORD_MODELS_PATH": "/stanford_models/french.jar", "STANFORD_NER_MODEL_PATH": "/stanford_models/ner-model-french.ser.gz", "NES_OUTPUT_PATH": "/french_pipeline/fr_articles_nes.json", "CORENLP_STANFORD_NER_MODEL_PATH": "/stanford_models/ner-model-french.ser.gz", "STANFORD_POSTAGGER_PATH": "/stanford_models/stanford-postagger.jar", "STANFORD_POS_MODEL_PATH": "/stanford_models/french.tagger", "POS_OUTPUT_PATH": "/french_pipeline/fr_articles_pos.json", "DEPENDENCY_TREE_KEEP_FIELDS": ["address", "ctag", "deps", "word", "head", "rel"], "STANFORD_CORENLP_MODELS_PATH": "/stanford_models/stanford-corenlp-3.7.0-models.jar", "STANFORD_DEPENDENCY_MODEL_PATH": "/stanford_models/UD_French.gz", "DEPENDENCY_OUTPUT_PATH": "/french_pipeline/fr_articles_dependencies.json", "VERB_NODE_POS_TAGS": ["VPP", "V", "VINF", "VPR", "VS"], "OMITTED_TOKENS_FOR_ALIGNMENT": [], "NER_TAGSET": ["I-PERS", "B-PERS", "I-LOC", "B-LOC", "I-ORG", "B-ORG", "I-MISC", "B-MISC"], "ORE_OUTPUT_PATH": "/french_pipeline/fr_articles_relations.json", "PARTICIPATION_PHRASES": {"I-PER": "particip\u00e9 \u00e0", "I-LOC": "est la sc\u00e8ne de", "I-ORG": "est impliqu\u00e9 dans", "I-MISC": "est li\u00e9 \u00e0", "DATE": "\u00e9tait au moment de", "DEFAULT": "particip\u00e9 \u00e0"}, "PE_OUTPUT_PATH": "/french_pipeline/fr_articles_participations.json", "DEFAULT_NE_TAG": "O", "RELATION_MERGING_OUTPUT_PATH": "/french_pipeline/fr_articles_merged_relations.json", "PC_OUTPUT_PATH": "/french_pipeline/fr_articles_properties.json", "RELEVANT_WIKIDATA_PROPERTIES": {"I-PER": ["P21", "P463", "P106", "P108", "P39", "P102", "P1416", "P18"], "I-LOC": ["P30", "P17", "P18"], "I-ORG": ["P1384", "P335", "P159", "P18"], "I-MISC": ["P18"]}, "WIKIDATA_PROPERTIES_IMPLYING_RELATIONS": {"P463": "Organization", "P108": "Company", "P102": "Party", "P1416": "Party", "P335": "Company"}, "PIPELINE_RUN_INFO_OUTPUT_PATH": "/french_pipeline/fr_info.json", "NEO4J_USER": "neo4j", "NEO4J_PASSWORD": "neo4jj", "NEO4J_HOST": "neo4j", "DATABASE_CATEGORIES": {"Entity": 0, "Organization": 2, "Company": 3, "Party": 3, "Miscellaneous": 1, "Affair": 6, "Politician": 3, "Person": 2, "Businessperson": 3, "Media": 4, "Location": 1, "Journalist": 5}})
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.6/site-packages/luigi/worker.py", line 139, in _run_get_new_deps
    task_gen = self.task.run()
  File "/data/src/app/bwg/decorators.py", line 89, in func_wrapper
    function_result = func(*args, **kwargs)
  File "/data/src/app/bwg/tasks/dependency_parsing.py", line 35, in run
    serializing_function=serialize_dependency_parse_tree, output_file=output_file,
  File "/data/src/app/bwg/decorators.py", line 89, in func_wrapper
    function_result = func(*args, **kwargs)
  File "/data/src/app/bwg/mixins.py", line 193, in process_articles
    for serializing_kwargs in self.task_workflow(article, **self.workflow_resources):
  File "/data/src/app/bwg/tasks/dependency_parsing.py", line 59, in task_workflow
    parsed_sentence = self._dependency_parse(sentence_data, **workflow_resources)
  File "/data/src/app/bwg/tasks/corenlp_server_tasks.py", line 78, in _dependency_parse
    sentence_data, action="depparse", postprocessing_func=self._postprocess_dependency_parsed,
  File "/data/src/app/bwg/mixins.py", line 56, in process_sentence_with_corenlp_server
    return postprocessing_func(result_json)
  File "/data/src/app/bwg/tasks/corenlp_server_tasks.py", line 91, in _postprocess_dependency_parsed
    if len(result_json["sentences"]) == 0:
TypeError: string indices must be integers
INFO: Informed scheduler that task   FrenchServerDependencyParseTask___PIPELINE_DEBUG_ef4d9c7c93   has status   FAILED
mickaelchanrion commented 5 years ago

Hi Peter,

We have updated some stuff with Dennis since yesterday.

Note:

Screenshot 2019-06-12 at 11 33 55 This section is taking a while and there's no progress indicators so just wait :)

It should work. At least, it did on my side, I got the project running. Btw, tests aren't working, I just skipped this section.

Let me know if it's better

Kaleidophon commented 5 years ago

Unfortunately there seem to be some errors with the pipeline sometimes that are more obscure and that I don't have time to look into right now :-( there is one that occurs when you use the data Mickael posted above that I am trying to fix now. If you use that data, the pipeline recognizes that some tasks have already been executed and skips them (except for the last task, which is writing the data into the database).