opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

neo4j plugin not working #70

Closed bhelou closed 5 years ago

bhelou commented 6 years ago

Hi,

For the export_neo4j plugin, I get the following exception

Exception while data enrichment of has_images.pdf with plugin export_neo4j: Primary label and primary key are required for MERGE operation

It happens on all files and just in case it doesn't, I've attached a file where the exception occurred for sure: test.txt.

To fix it, I use transactions. Specifically, I changed

graph.merge(document_node)

to

tx = graph.begin()
tx.merge(document_node, primary_label = "Document", primary_key = "name")

and

graph.merge(entity_node)

to

tx.merge(entity_node, primary_label = entity_class_label, primary_key = 'name')

and

graph.merge(relationship)

to

tx.merge(relationship, primary_label = relationship_label)

At the end of the process function, I've added

tx.commit()

neo4j now works. I don't know though if there's a better way to do the above (I just learned about neo4j :p).

Also, when I try to reset my OSS index (via opensemanticsearch-delete --empty), it doesn't empty the neo4j database. To remove it, I delete the database files via rm -rf /opt/neo4j/data/databases/*. I then restart neo4j via service neo4j restart. I'm not sure if this is a bug or a feature, but I thought I'd mention it.

Thanks for the OSS software! Bassam

ptmaroct commented 5 years ago

Hi, after making these changes, I did get some progress however I got stuck again due to following error. The command I ran was:

$ etl-web http://anujsh.com -v

selection_157

Mandalka commented 5 years ago

I upgraded to newest Neo4j and the new py2neo 4 seems to work with the py2neo lib instead of underlaying Neo4j Driver for Python, so i changed the pip3 install to use py2neo 4 again.

Maybe for upgrading you need to pip3 uninstall py2neo first if yet pinned to version 3.x

Maybe the DB format changed, so Neo4j db has to be deleted like described below.

Please reopen (or i will if further tests on fresh debian will cause problems) if yet problems after upgrade to Neo4j 3.5.2 and Py2neo 4.x and newest Open Semantic ETL.