varun196 / knowledge_graph_from_unstructured_text

Building knowledge graph from input data
47 stars 29 forks source link

IndexError: list index out of range while coreference #3

Open anindyasdas opened 4 years ago

anindyasdas commented 4 years ago
Traceback (most recent call last):
  File "knowledge_graph.py", line 292, in <module>
    main()
  File "knowledge_graph.py", line 287, in main
    doc = resolve_coreferences(doc,stanford_core_nlp_path,named_entities,verbose)
  File "knowledge_graph.py", line 217, in resolve_coreferences
    result = coref_obj.resolve_coreferences(corefs,doc,ner,verbose)
  File "knowledge_graph.py", line 200, in resolve_coreferences
    replaced_sent = words[i] + " "+ replaced_sent
IndexError: list index out of range

Data file added for reproducing the error input_data (1).txt

Primary analysis suggests: The file has tokens like: " North-East", and "third-largest", stanford tokenizer for coreference splits across hyphen, while nltk does does not. So, as per , nltk the token length of corresponding sentence is 37, which does not match co-reference indices (with 41 tokens) ['North', '-','East',third','-','largest']

Kojo7 commented 3 years ago

Also having the same Issue..Anyone can help please..?

anindyasdas commented 3 years ago

The issue is mainly due to use of different tokenizer. Two different tokenizer are used , specific problems arise while handling with "-" or special characters. Use Spacy tokenizer instead of nltk or white space.

Kojo7 commented 3 years ago

Thanks very much, that helped