robert-haas / kgw

Knowledge graph workflows with Python.
https://robert-haas.github.io/kgw-docs
Apache License 2.0
0 stars 0 forks source link

output csv format causes error importing into neo4j #1

Open mjsduncan opened 14 hours ago

mjsduncan commented 14 hours ago

while loading monarchkg_v2024-09-12/results/kg_nodes.csv into a neo4j db (version 5.24.2) i received this error:

// Import nodes from kg_nodes.csv in batches
             CALL {
                 LOAD CSV WITH HEADERS FROM 'file:///kg_nodes.csv' AS row
                 WITH row.id AS id, row.type AS type, apoc.convert.fromJsonMap(row.properties) AS properties
                 CREATE (n:Entity)
                 SET n.id = id,
                     n.type = type,
                     n += properties
             } IN TRANSACTIONS;
At /var/lib/neo4j/import/kg_nodes.csv @ position 26116004 -  there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: '"f'

the offending quotes are in the line containing character position 26116004:

"Xenbase:XB-GENE-29084259","biolink:Gene","{""name"":""LOC100498474"",""xref"":""UniProtKB:A0A8J0T6Y3|NCBIGene:100498474|RefSeq:KAE8578778|Xenbase:XB-GENE-29084259|Xenbase:XB-GENEPAGE-29084258"",""in_taxon"":""NCBITaxon:8364"",""in_taxon_label"":""Xenopus tropicalis"",""provided_by"":""alliance_gene_nodes"",""synonym"":""LOC100498474|retinal cone rhodopsin-sensitive cGMP 3\"",5\""-cyclic phosphodiesterase subunit gamma"",""full_name"":""retinal cone rhodopsin-sensitive cGMP 3\"",5\""-cyclic phosphodiesterase subunit gamma"",""symbol"":""LOC100498474"",""type"":""SO:0000704""}"

there is a comma in the word 3,5-cyclic in the full name string ""full_name"":""retinal cone rhodopsin-sensitive cGMP 3\"",5\""-cyclic phosphodiesterase subunit gamma"" that is being read as a table delimiter because the code is erroneously inserting an escape sequence to include the ,.

this appears to break the json syntax but it might just be a neo4j thing, i haven't tried to import the csv file anywhere else.

robert-haas commented 11 hours ago

Thanks for isolating the problem! The escaped quote character before the comma indeed looks suspicious. I'll try to load the CSV file in dask and Neo4j to reproduce the issue. Could you please let me know which version you are using with neo4j --version or cypher-shell --version. The full import script would also help if you want to share it.

robert-haas commented 7 hours ago

image

Check 1: Upstream from the CSV export, the entry in the SQLite database for Xenbase:XB-GENE-29084259 seems to be fine. The JSON string in the properties column is parsed correctly by DB Browser for SQLite. The escaped quote character is required to keep the value string intact.

robert-haas commented 7 hours ago

image

Check 2: Python's csv module, which was also used to export the data in the first place, can read kg_nodes.csv without an issue and the suspicious entry seems also to be decoded correctly.

robert-haas commented 7 hours ago

image

Check 3: LibreOffice Calc is able to load the relevant portion of the CSV file without a problem. The subset was generated with head kg_nodes.csv -n 70000 | tail -n 10000 > kg_nodes_subset.csv. This was necessary because the full CSV file contains too many lines for the program.

robert-haas commented 7 hours ago

Preliminary conclusion: It seems to be a problem with Neo4j's CSV loader. The CSV export in kgw uses Python's built-in csv module with writer = csv.writer(f, dialect="excel", quoting=csv.QUOTE_ALL), which was suggested somewhere on the web to have high compatibility. Perhaps another parameterization would result in a difference in how the quotes are set in the file, potentially so that Neo4j would also be able to load it.

robert-haas commented 6 hours ago

image

I was able to reproduce the error with a fresh installation of neo4j 4.4.38 with apoc from https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.4.0.32/apoc-4.4.0.32-core.jar.

robert-haas commented 5 hours ago

Further observation: The slightly modified writer = csv.writer(f, dialect="excel", quoting=csv.QUOTE_ALL, escapechar='\\') results in a CSV file that Neo4j can load. When reading the file with Python's csv module, the raw data can be loaded as well, but the property column is different and using json.loads(nprop) on it leads to a decoding error due to having \\" instead of \" within a quoted value, which leads the string to terminate too early.