hetio / hetionet

Hetionet: an integrative network of disease
https://neo4j.het.io
257 stars 68 forks source link

Speeding up data import to Neo4j v5 and CSV format data #57

Open nickzren opened 10 months ago

nickzren commented 10 months ago

I encountered challenges while trying to load Hetionet data into my updated MacBook's Neo4j version 5.13. The existing Neo4j dumps were no longer compatible, and directly importing the data in JSON format was too time-consuming, taking an estimated 10+ hours.

To address this, I've written a script that efficiently converts JSON data to CSV format without any loss in node, edge, or property value information. The JSON-to-CSV conversion takes approximately 30 seconds, while uploading the CSV to Neo4j takes around 40 seconds.

I've organized each node and edge type into its own respective CSV file and accompanying Cypher script. I believe this will make it easier for people to understand and work with the data.

If this sounds useful, I'd be open to integrating these changes into the main branch. Let me know your thoughts.

You can find the revised code at: https://github.com/nickzren/hetionet/tree/csv

dhimmel commented 10 months ago

Awesome work @nickzren. Nice job finding an efficient import method that works with the latest Neo4j stack.

I took a quick look at the changes and I'll need a little more time to think about where the code belongs... since it could possibly live in dhimmel/integrate or hetio/hetnetpy rather than in this repo whose focus is more the data and not the code to generate the data.

Taking a step back, there's a couple contributions that will be of major utility (in order of importance/interest):

  1. a neo4j dump file that is compatible with neo4j 5 (and possibly future neo4j versions)
  2. code to generate the neo4j database and dump file (i.e. your csv branch)
  3. the csv files, but I'm a little cautious in that they have some similarities with the TSV files and we'd want to understand and document the differences.

Any thoughts?

nickzren commented 10 months ago

Thanks @dhimmel

I recognize the primary aim of this repo is data-centric rather than code-centric. However, I respectfully suggest that including data-specific scripts, such as Python and Cypher, could add value. No strong opinion here, and I'll defer to your ultimate decision on the matter.

The CSV data is derived from JSON files and serves as a comprehensive dataset, including all properties for both nodes and edges.

CSV files and Cypher scripts are generated simultaneously to ensure data type consistency, making the data import into Neo4j unaffected by version changes.

Having separate CSV files for nodes and edges not only enhances the framework's comprehensibility but also allows users to easily choose or modify or extend the data.

I encountered difficulties with restoring from Neo4j dump files and couldn't resolve the issues, so I began exploring alternative solutions.

I'm still learning about this and graph databases, so any corrections are welcome.