unipop-graph / unipop

Data Integration Graph
Apache License 2.0
203 stars 35 forks source link

Does unipop support modeling a single table data as a 'virtual' graph? #111

Open sorryya opened 6 years ago

sorryya commented 6 years ago

For example, if a elasticsearch document containing some vertexes and edges, and each vertex or edge is represented by a set of fields from the document, how to write the mapping file?

seanbarzilay commented 6 years ago

@sorryya Did you mean this kind of mapping Inner Edges?

sorryya commented 6 years ago

I mean: Elastic document like this:

{
    "_index": "xxx",
    "_type": "yyy",
    "_id": "AV-VSXTUbcKGrP6qekMg",
    "_source": {
        "field_1": "1111",
        "field_2": "2222",
        "field_3": "3333",
        "field_4": "4444",
        "field_5": "5555",
        "field_6": "6666",
        "field_7": "7777",
        "field_8": "8888",
        "field_9": "9999"
    }
}

My scene:

  1. Each document represents a event, and I want to model a graph about cooccurrence relations of the objects in the event.
  2. Some fields about the event are for edges, some fields about the objects are for vertices.
  3. So, one field may be as an id or a property for several edges or vertices, a field as vertex id may have duplicate value in documents.
  4. The "id" may be combined by a set of fields.
  5. The "index" should be all indexes or some indexes in elasticsearch.

Can Mapping file be like this?

{
  "class": "org.unipop.elastic.ElasticSourceProvider",
  "clusterName": "escluster",
  "addresses": "http://localhost:9200",
  "edges": [
    {
      "index": "*",
      "id": {
        "fields": ["some_value", "@_id"],
        "delimiter": "+"
      },
      "label": "lable_e1",
      "properties": {
        "field_1": "@field_1",
        "field_2": "@field_2",
        "field_3": "@field_3"
      },
      "outVertex":{
        "ref": false,
        "id": "@field_4",
        "label": "lable_v1",
        "properties": {
          "field_5": "@field_5"
        }
      },
      "inVertex":{
        "ref": false,
        "id": {
          "fields": ["@field_6", "@field_7"],
          "delimiter": "+"
        },
        "label": "lable_v2",
        "properties": {
          "property_name": {
            "fields": ["@field_6", "@field_7"],
            "delimiter": "+"
          },
        }
      }
    },
    {
      "index": "*",
      "id": "@_id",
      "label": "lable_e2",
      "properties": {
        "field_1": "@field_1",
        "field_2": "@field_2",
        "field_5": "@field_4",
        "field_7": "@field_8",
      },
      "outVertex":{
        "ref": false,
        "id": "@field_4",
        "label": "lable_v3",
        "properties": {
          "field_5": "@field_5"
        }
      },
      "inVertex":{
        "ref": false,
        "id": "@field_9",
        "label": "lable_v4",
        "properties": {
          "field_9": "@field_9"
        }
      }
    }
  ]
}

In this case, here are the problems I have met:

  1. If I use "ref" as false in "outVertex" or "inVertex", it throws:java.lang.NullPointerException.
  2. The count of edges I queried is much less than it actually is, which g.E().count() got 9881, but the elastic documents count is 7242721.
  3. If I define vertices all within edges in mapping file, the count of vertices I got is 0 use g.V().count().
  4. If I defind vertices as independent ones(not within edges), the count of vertices I got is much less than it actually is, which g.V().values("field_4").count() got 251, but the distinct count of field_4 (as the vertex's id and property) in elasticsearch is 753.
  5. When I use the fuction "has(...)" to query, I got nothing.
seanbarzilay commented 6 years ago

@sorryya I haven't tested a schema where both vertices are non reference vertices, so I will fix it and release a patch in the next few days.