Open mjsduncan opened 14 hours ago
Thanks for isolating the problem! The escaped quote character before the comma indeed looks suspicious. I'll try to load the CSV file in dask and Neo4j to reproduce the issue. Could you please let me know which version you are using with neo4j --version
or cypher-shell --version
. The full import script would also help if you want to share it.
Check 1: Upstream from the CSV export, the entry in the SQLite database for Xenbase:XB-GENE-29084259
seems to be fine. The JSON string in the properties column is parsed correctly by DB Browser for SQLite. The escaped quote character is required to keep the value string intact.
Check 2: Python's csv module, which was also used to export the data in the first place, can read kg_nodes.csv
without an issue and the suspicious entry seems also to be decoded correctly.
Check 3: LibreOffice Calc is able to load the relevant portion of the CSV file without a problem. The subset was generated with head kg_nodes.csv -n 70000 | tail -n 10000 > kg_nodes_subset.csv
. This was necessary because the full CSV file contains too many lines for the program.
Preliminary conclusion: It seems to be a problem with Neo4j's CSV loader. The CSV export in kgw uses Python's built-in csv module with writer = csv.writer(f, dialect="excel", quoting=csv.QUOTE_ALL)
, which was suggested somewhere on the web to have high compatibility. Perhaps another parameterization would result in a difference in how the quotes are set in the file, potentially so that Neo4j would also be able to load it.
I was able to reproduce the error with a fresh installation of neo4j 4.4.38
with apoc from https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.4.0.32/apoc-4.4.0.32-core.jar
.
Further observation: The slightly modified writer = csv.writer(f, dialect="excel", quoting=csv.QUOTE_ALL, escapechar='\\')
results in a CSV file that Neo4j can load. When reading the file with Python's csv module, the raw data can be loaded as well, but the property column is different and using json.loads(nprop)
on it leads to a decoding error due to having \\"
instead of \"
within a quoted value, which leads the string to terminate too early.
while loading monarchkg_v2024-09-12/results/kg_nodes.csv into a neo4j db (version 5.24.2) i received this error:
the offending quotes are in the line containing character position 26116004:
there is a comma in the word 3,5-cyclic in the full name string ""full_name"":""retinal cone rhodopsin-sensitive cGMP 3\"",5\""-cyclic phosphodiesterase subunit gamma"" that is being read as a table delimiter because the code is erroneously inserting an escape sequence to include the
,
.this appears to break the json syntax but it might just be a neo4j thing, i haven't tried to import the csv file anywhere else.