jexp / batch-import

generic csv file neo4j batch importer
https://neo4j.com/docs/operations-manual/current/tools/import/
385 stars 157 forks source link

Import Error #114

Closed jtgreen closed 9 years ago

jtgreen commented 9 years ago

I've seen chatter about this across the web, but no recommendation seems to work, including batch_import.csv.quotes=true. The stack is as follows with quotes=true: Total import time: 28236 seconds Exception in thread "main" java.lang.NumberFormatException: For input string: "C0003787" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:441) at java.lang.Long.parseLong(Long.java:483) at org.neo4j.batchimport.Importer.id(Importer.java:213) at org.neo4j.batchimport.Importer.id(Importer.java:181) at org.neo4j.batchimport.Importer.importRelationships(Importer.java:147) at org.neo4j.batchimport.Importer.doImport(Importer.java:232) at org.neo4j.batchimport.Importer.main(Importer.java:83) john@john-MS-7693:~/Dev/Sandbox/UMLSVis/batch-import$ cat cuistr_agg.csv | grep C0003787 C0003787 ['Arizona - US state', 'AZ', 'Arizona - US state (geographic location)', 'Arizona'] ['T083']

my nodes files looks like:

cui:string:concepts syns stys C0000005 ['(131)I-MAA', '(131)I-Macroaggregated Albumin'] ['T116', 'T130', 'T121'] C0000039 ['Dipalmitoylglycerophosphocholine', '1,2-Dihexadecyl-sn-Glycerophosphocholine', '1,2-Dipalmitoylphosphatidylcholine', 'Dipalmitoylphosphatidylcholine (substance)', 'Phosphatidylcholine, Dipalmitoyl', '1,2 Dipalmitoylphosphatidylcholine', '1,2 Dipalmitoyl Glycerophosphocholine', 'DPPC', 'Dipalmitoyllecithin', '1,2 Dihexadecyl sn Glycerophosphocholine', '3,5,9-Trioxa-4-phosphapentacosan-1-aminium, 4-hydroxy-N,N,N-trimethyl-10-oxo-7-((1-oxohexadecyl)oxy)-, inner salt, 4-oxide', '1,2-Dipalmitoyl-Glycerophosphocholine', 'Dipalmitoylphosphatidylcholine', 'Dipalmitoyl Phosphatidylcholine', 'DIPALMITOYLPHOSPHATIDYLCHOLINE 0102', '1,2-Dipalmitoylphosphatidylcholine [Chemical/Ingredient]'] ['T121', 'T119'] C0000052 ['Branching Enzyme, 1,4-alpha-Glucan', 'Amylo (1-4 to 1-6)-transglucosidase', 'Enzyme, 1,4-alpha-Glucan Branching', '1,4 alpha Glucan Branching Enzyme', 'Branching enzyme', '1,4-alpha-Glucan branching enzyme', 'GLUCAN BRANCHING ENZYME', 'Branching Enzyme, Starch', 'Branching Enzyme', '1,4-alpha-D-Glucan:1,4-alpha-D-glucan 6-alpha-D-(1,4-alpha-D-glucano)-transferase', 'Enzyme, Branching', 'ALPHA GLUCAN BRANCHING ENZYME 01 04', 'Branching Glycosyltransferase', 'Amylo-(1,4,6)-transglycosylase', 'Glycosyltransferase, Branching', 'Amylo-(1,4->,6)-transglycosylase', 'Enzyme, Starch Branching', '1,4-Alpha glucan branching enzyme', '1,4-alpha-Glucan branching enzyme (substance)', '1,4-alpha-Glucan Branching Enzyme', 'alpha-Glucan-branching glycosyltransferase', 'Starch Branching Enzyme', '1,4-alpha-Glucan Branching Enzyme [Chemical/Ingredient]'] ['T116', 'T126'] C0000074 ['1 Alkyl 2 Acylphosphatidates', '1-Alkyl-2-Acylphosphatidates'] ['T119'] C0000084 ['1 Carboxyglutamic Acid', 'gamma Carboxyglutamic Acid', '3-Amino-1,1,3-propanetricarboxylic Acid', 'CARBOXYGLUTAMIC ACID 01', 'gamma-Carboxyglutamic Acid', '1,1,3-Propanetricarboxylic acid, 3-amino-', '1-Carboxyglutamic Acid [Chemical/Ingredient]', '1-Carboxyglutamic Acid'] ['T116', 'T123'] C0000096 ['Isobutyltheophylline', 'ISOBUTYLMETHYLXANTHINE 03 01', 'MIBX', '1-Methyl-3-isobutylxanthine', '3 Isobutyl 1 methylxanthine', '1 Methyl 3 isobutylxanthine', 'METHYLISOBUTYLXANTHINE 01 03', '1H-Purine-2,6-dione, 3,7-dihydro-1-methyl-3-(2-methylpropyl)-', 'IBMX', '1-Methyl-3-isobutylxanthine [Chemical/Ingredient]', '3-Isobutyl-1-methylxanthine'] ['T109', 'T121'] C0000097 ['Methylphenyltetrahydropyridine (substance)', '1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine', 'N-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine', 'MPTP', '1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine [Chemical/Ingredient]', 'mptp', '1-Methyl-4-Phenyl-1,2,3,6-Tetrahydropyridine (MPTP)', 'methylphenyltetrahydropyridine', 'METHYLPHENYLTETRAHYDROPYRIDINE 01 04 01 02 03 06', 'Pyridine, 1,2,3,6-tetrahydro-1-methyl-4-phenyl-', 'Methylphenyltetrahydropyridine'] ['T109', 'T131'] C0000098 ['1 Methyl 4 phenylpyridine', 'N METHYL 4 PHENYLPYRIDINIUM', 'N-Methyl-4-phenylpyridine', 'Cyperquat', 'Pyridinium, 1-methyl-4-phenyl-', '1 Methyl 4 phenylpyridinium', 'N Methyl 4 phenylpyridine', 'MPP+', '1 Methyl 4 phenylpyridinium Ion', '1-Methyl-4-phenylpyridinium Ion', 'N-Methyl-4-phenylpyridinium', '1-Methyl-4-phenylpyridine', '1-Methyl-4-phenylpyridinium [Chemical/Ingredient]', 'METHYLPHENYLPYRIDINIUM 01 04', '1-Methyl-4-phenylpyridinium'] ['T109', 'T131']

My relations: cui:string:concepts cui:string:concepts rela C0236642 C0270715 \N C0003787 C0037728 \N C0018090 C0032636 \N C0039194 C0024264 \N C0004561 C0024264 \N C0022801 C0035287 \N C0022801 C0227525 \N C0022801 C0449475 \N C0034143 C0682702 \N

I am executing the following command after making the Chunker BUFSIZE 128*1024: java -server -Dfile.encoding=UTF-8 -Xmx4G -jar target/batch-import-jar-with-dependencies.jar umls.db cuistr_agg.csv cuirel.csv

And My config is this (+/- batch_import.csv.quotes=true): dump_configuration=true cache_type=none use_memory_mapped_buffers=true neostore.propertystore.db.index.keys.mapped_memory=25M neostore.propertystore.db.index.mapped_memory=25M neostore.nodestore.db.mapped_memory=400M neostore.relationshipstore.db.mapped_memory=5G neostore.propertystore.db.mapped_memory=400M neostore.propertystore.db.strings.mapped_memory=400M batch_array_separator=,

batch_import.csv.quotes=true

batch_import.csv.delim=,

batch_import.node_index.concepts=exact batch_import.csv.quotes=true

jtgreen commented 9 years ago

Update: I made quotes=false again and made the csv.delim=\t and now I get: Exception in thread "main" java.lang.NullPointerException at org.neo4j.batchimport.Importer.lookup(Importer.java:130) at org.neo4j.batchimport.Importer.id(Importer.java:183) at org.neo4j.batchimport.Importer.importRelationships(Importer.java:147) at org.neo4j.batchimport.Importer.doImport(Importer.java:232) at org.neo4j.batchimport.Importer.main(Importer.java:83)