jexp / batch-import

generic csv file neo4j batch importer
https://neo4j.com/docs/operations-manual/current/tools/import/
385 stars 158 forks source link

Node NotFoundException On Every Node Inserted By Batch Importer #90

Closed bgottfried91 closed 10 years ago

bgottfried91 commented 10 years ago

Using the 2.0 branch of the batch importer, I'm able to import ~11 million nodes and ~94 million relationships, but apparently only 1 property is being imported: image Currently the batch importer properties allocate 10GB to node storage and relationship storage each and 12GB each to property and long-string storage; I'm working on a system with 48GB of RAM, so those are pretty much the upper limits of the system. Attempting to query for any node results in this error: image Any suggestions on what I need to change about my import process to fix the issue?

cescalante-carecloud commented 10 years ago

can you post your equivalent of the sample/import.sh file ?

bgottfried91 commented 10 years ago

Wasn't sure if you meant the code in the actual import.sh file used to run the batch import (I'm not using Maven to build it) or if you meant the args provided in the command.

  1. I'm using the import.sh script that comes with the zip from the readme. The code inside of it is as follows: HEAP=4G DB=${1-target/graph.db} shift NODES=${1-nodes.csv} shift RELS=${1-rels.csv} shift CP="" for i in lib/*.jar; do CP="$CP":"$i"; done

    echo java -classpath $CP -Xmx$HEAP -Xms$HEAP -Dfile.encoding=UTF-8 org.neo4j.batchimport.Importer batch.properties "$DB" "$NODES" "$RELS" "$@"

    java -classpath $CP -Xmx$HEAP -Xms$HEAP -Dfile.encoding=UTF-8 org.neo4j.batchimport.Importer batch.properties "$DB" "$NODES" "$RELS" "$@"

The command to run the import is: bash import.sh nonIndexed.db uids.csv,pmids.csv rels.csv,synonyms.csv

The csv files range from several MB to multiple GB in size, so I can find somewhere to put them up, but they're gigantic...

jexp commented 10 years ago

Could it be that you used a different field separator? The default is tab, you can configure a comma though?

Perhaps try it with a single smaller file first to find the issue? Perhaps you can share a smalls sample file that makes this issue reproducible

jexp commented 10 years ago

Btw can you share your batch.properties. You don't need that much storage for strings and properties, 1-2GB each should be fine try to give the relationship-store the most memory.

bgottfried91 commented 10 years ago

I'll include the first 10 lines of each of the CSVs, as well as the batch.properties. I'll cut out a small portion of each of the files and put up a link to them: head uids.csv uid name D000001 Calcimycin D000001-1 A-23187 D000001-2 A 23187 D000001-3 Antibiotic A23187 D000001-4 A23187, Antibiotic D000001-5 A23187 D000002 Temefos D000002-1 Temephos D000002-2 Abate

head pmids.csv pmid 12255683 12334433 20255877 12255369 12255508 12305503 12233291 12259097 12334491

head rels.csv pmid uid type 218986 94827 Mentions 218987 35807 Mentions 218987 44082 Mentions 218987 44093 Mentions 218987 57667 Mentions 218987 75228 Mentions 218987 75242 Mentions 218987 83565 Mentions 218987 106937 Mentions

head synonyms.csv uid synonym type 0 0 MENTIONS 0 1 MENTIONS 0 1 MENTIONS 0 1 MENTIONS 0 1 MENTIONS 0 1 MENTIONS 6 6 MENTIONS 6 7 MENTIONS 6 7 MENTIONS

cat batch.properties use_memory_mapped_buffers=true neostore.nodestore.db.mapped_memory=10G neostore.relationshipstore.db.mapped_memory=10G neostore.propertystore.db.mapped_memory=12G neostore.propertystore.db.strings.mapped_memory=12G neostore.propertystore.db.arrays.mapped_memory=0M neostore.propertystore.db.index.keys.mapped_memory=15M neostore.propertystore.db.index.mapped_memory=15M

jexp commented 10 years ago

Can you please share them as zip-file otherwise line-endings and delimiters are mangled?

bgottfried91 commented 10 years ago

Gzipped each of the four input files and put them into this folder on drive. Let me know if they're broken in some way: https://drive.google.com/folderview?id=0Bx98DkxmHnEtWE5BRzlfM2lqYTQ&usp=sharing

A little context from my testing of this sample: when I constructed the database from this and started the server with it, the properties were there and I could query for specific nodes. The only thing that was changed in these files from the original files was that the two relation files were heavily truncated. Does this mean I need to be allocating more memory for the relations?

bgottfried91 commented 10 years ago

Final Update: Whatever issue that was happening, I can't seem to reproduce the issue now using the full files. As such, I'll be closing the issue, though if anyone has any idea why it might have occurred originally, I'd love to hear about it.