vesoft-inc / nebula-importer

Nebula Graph Importer with Go
Apache License 2.0
90 stars 60 forks source link
csv csv-import golang nebula-graph

codecov.io Go Report Card GolangCI GoDoc

What is NebulaGraph Importer?

NebulaGraph Importer is a tool to import data into NebulaGraph.

Features

See configuration instructions for more features.

How to Install

From Releases

Download the packages on the Releases page, and give execute permissions to it.

You can choose according to your needs, the following installation packages are supported:

From go install

$ go install github.com/vesoft-inc/nebula-importer/cmd/nebula-importer@latest

From docker

$ docker pull vesoft/nebula-importer:<version>
$ docker run --rm -ti \
      --network=host \
      -v <config_file>:<config_file> \
      -v <data_dir>:<data_dir> \
      vesoft/nebula-importer:<version>
      --config <config_file>

# config_file: the absolute path to the configuration file.
# data_dir: the absolute path to the data directory, ignore if not a local file.
# version: the version of NebulaGraph Importer.

From Source Code

$ git clone https://github.com/vesoft-inc/nebula-importer
$ cd nebula-importer
$ make build

You can find a binary named nebula-importer in bin directory.

Configuration Instructions

NebulaGraph Importer's configuration file is in YAML format. You can find some examples in examples.

Configuration options are divided into four groups:

client

client:
  version: v3
  address: "127.0.0.1:9669"
  user: root
  password: nebula
  ssl:
    enable: true
    certPath: "your/cert/file/path"
    keyPath: "your/key/file/path"
    caPath: "your/ca/file/path"
    insecureSkipVerify: false
  concurrencyPerAddress: 16
  reconnectInitialInterval: 1s
  retry: 3
  retryInitialInterval: 1s

manager

  spaceName: basic_int_examples
  batch: 128
  readerConcurrency: 50
  importerConcurrency: 512
  statsInterval: 10s
  hooks:
    before:
      - statements:
          - UPDATE CONFIGS storage:wal_ttl=3600;
          - UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
      - statements:
          - |
            DROP SPACE IF EXISTS basic_int_examples;
            CREATE SPACE IF NOT EXISTS basic_int_examples(partition_num=5, replica_factor=1, vid_type=int);
            USE basic_int_examples;
        wait: 10s
    after:
      - statements:
          - |
            UPDATE CONFIGS storage:wal_ttl=86400;
            UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };

log

log:
  level: INFO
  console: true
  files:
    - logs/nebula-importer.log

sources

sources is the configuration of the data source list, each data source contains data source information, data processing and schema mapping.

The following are the relevant configuration items.

path

It only needs to be configured for local file data sources.

path: ./person.csv

s3

It only needs to be configured for s3 data sources.

s3:
  endpoint: <endpoint>
  region: <region>
  bucket: <bucket>
  key: <key>
  accessKeyID: <Access Key ID>
  accessKeySecret: <Access Key Secret>

oss

It only needs to be configured for oss data sources.

oss:
  endpoint: <endpoint>
  bucket: <bucket>
  key: <key>
  accessKeyID: <Access Key ID>
  accessKeySecret: <Access Key Secret>

ftp

It only needs to be configured for ftp data sources.

ftp:
  host: 192.168.0.10
  port: 21
  user: <user>
  password: <password>
  path: <path of file>

sftp

It only needs to be configured for sftp data sources.

sftp:
  host: 192.168.0.10
  port: 22
  user: <user>
  password: <password>
  keyFile: <keyFile>
  keyData: <keyData>
  passphrase: <passphrase>
  path: <path of file>

hdfs

It only needs to be configured for hdfs data sources.

hdfs:
  address: 192.168.0.10:8020
  user: <user>
  servicePrincipalName: <Kerberos Service Principal Name>
  krb5ConfigFile: <Kerberos config file>
  ccacheFile: <Kerberos ccache file>
  keyTabFile: <Kerberos keytab file>
  password: <Kerberos password>
  dataTransferProtection: <Kerberos Data Transfer Protection>
  disablePAFXFAST: false
  path: <path of file>

gcs

It only needs to be configured for gcs data sources.

gcs:
  endpoint: <endpoint>
  bucket: <bucket>
  key: <key>
  credentialsFile: <Service account or refresh token JSON credentials file>
  credentialsJSON: <Service account or refresh token JSON credentials>
  withoutAuthentication: <false | true>

batch

batch: 256

csv

csv:
  delimiter: ","
  withHeader: false
  lazyQuotes: false
  comment: ""

tags

tags:
- name: Person
  mode: INSERT
  filter:
    expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
  id:
    type: "STRING"
    function: "hash"
    index: 0
  ignoreExistedIndex: true
  props:
    - name: "firstName"
      type: "STRING"
      index: 1
    - name: "lastName"
      type: "STRING"
      index: 2
    - name: "gender"
      type: "STRING"
      index: 3
      nullable: true
      defaultValue: male
    - name: "birthday"
      type: "DATE"
      index: 4
      nullable: true
      nullValue: _NULL_
    - name: "creationDate"
      type: "DATETIME"
      index: 5
    - name: "locationIP"
      type: "STRING"
      index: 6
    - name: "browserUsed"
      type: "STRING"
      index: 7
      nullable: true
      alternativeIndices:
        - 6

# concatItems examples
tags:
- name: Person
  id:
    type: "STRING"
    concatItems:
      - "abc"
      - 1
    function: hash

edges

edges:
- name: KNOWS
  mode: INSERT
  filter:
    expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
  src:
    id:
      type: "INT"
      index: 0
  dst:
    id:
      type: "INT"
      index: 1
  rank:
    index: 0
  ignoreExistedIndex: true
  props:
    - name: "creationDate"
      type: "DATETIME"
      index: 2
      nullable: true
      nullValue: _NULL_
      defaultValue: 0000-00-00T00:00:00

See the Configuration Reference for details on the configurations.