neo4j / neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
https://neo4j.com/developer/spark/
Apache License 2.0
313 stars 112 forks source link

If a node has already been created, you want to establish a relationship without creating a node, but duplicate nodes occur. #383

Closed MustangYun closed 2 years ago

MustangYun commented 3 years ago

Feature description (Mandatory)

Nodes have already been created in the past, but if you add only a relationship, a duplicate node is created again.

Considered alternatives

image

There is a table containing the customer's subscription information and a table containing information about the people the customer follows. And information about the people the customers follow is included in the customer subscriptions table. That is, the IDs of the people you follow must be in the IDs of the customer subscription information table.

Please add the merge function. The example source code I made in neo4j is as follows.

LOAD CSV WITH HEADERS FROM 'customer.csv' AS ROW CREATE(c:Customer { custNo: ROW.ID, ...(skip) })

LOAD CSV WITH HEADERS FROM 'follow.csv' AS row

match (a:Customer),(b:Customer) where a.custNo=row.CUST_NO and b.custNo=row.followID merge(a)<-[r:follow]-(b)

How this feature can improve the project?

It would be nice if the merge function was added in scala so that no nodes are created, only relationships are created. Please in Scala language.

utnaf commented 3 years ago

Hi @MustangYun, can you share the Apache Spark code that causes this issue? Which version of Spark, Scala and connector are you using?

Thank you

MustangYun commented 3 years ago
안녕하세요. Davide Fantuzzi 님처음으로 인사드립니다.영어를 잘 못해서 구글번역기를 사용합니다. 먼저 저의 github 게시판에 제기한 문제를 풀어주시려고 직접 메일까지 보내주셔서 감동 받았습니다. 현재 제가 사용하는 버전을 공유드립니다. 릴리스 레이블:emr-6.3.0Hadoop 배포:Amazon 3.2.1애플리케이션:Hue 4.9.0, Spark 3.1.1, Oozie 5.2.1, Zeppelin 0.9.0 그리고 제가 구성한 데이터 파이프라인과 상황을 설명해드리면ORACLE(12.X) 에서 AWS EMR에서 읽습니다. 그때 scala 코드를 사용하고 있고, 각 ROW들을 Neo4j docker로 보내서 Node로 만듭니다.Node를 생성하지 않고 Relationship 모델만을 생성할 데이터는 사전에 S3에 저장한 다음,APOC의 merge기능으로 노드를 중복해서 발생시키지 않고 관계형성만 시킵니다. 정리하자면 제가 원했던 요청사항은 다음과 같습니다. 1. 기존의 생성된 노드 중에 같은 id값을 가지고 있다면 노드를 생성하지 않는다.(중복노드 발생금지) 2. 생성된 노드 간에  relationship을 만든다.(merge realation 기능)   [Google Translate]Hello. Davide FantuzziGreetings for the first time.I'm not good at English, so I use Google Translate. First of all, I was moved by sending an email directly to solve the problem I raised on my github board. I'm sharing the version I'm currently using. Release label: emr-6.3.0Hadoop distribution: Amazon 3.2.1Applications: Hue 4.9.0, Spark 3.1.1, Oozie 5.2.1, Zeppelin 0.9.0 And if I explain the data pipeline and situation that I have configured,Read from AWS EMR in ORACLE (12.X). At that time, I am using scala code, and send each ROW to Neo4j docker to make it a Node.The data to create only the relationship model without creating a node is stored in S3 in advance, and thenAPOC's merge function does not duplicate nodes and only creates relationships.  Here we share the code. %spark // Connection url and the table (can also be query instead of table)val url = ***@***.*** :1521/DBSCNAME"//val table = "ENC_CUST_T"val table6 = "INTERESTED_CUST_T" // Load the table into DataFrame// Decimal type => int typeval cust_t = spark.read.format("jdbc").options(Map("driver"->"oracle.jdbc.driver.OracleDriver","url" -> url,"dbtable" -> table3)).load.toDF()                   .withColumn("col1", col("col1").cast("int"))                   .withColumn("col2", col("col2").cast("int"))                   .withColumn("col3", col("co3").cast("int"))                   .withColumn("col4", col("col4").cast("int"))  val intrested_cust = spark.read.format("jdbc").options(Map("driver"->"oracle.jdbc.driver.OracleDriver","url" -> url,"dbtable" -> table6)).load.toDF()  %sparkcust_t  .limit(1000)  .write  .format("org.neo4j.spark.DataSource")  .mode("Overwrite")  .option("url", "neo4jIP")  .option("relationship", " INTERESTED_CUST_T ")  .option("relationship.save.strategy", "keys")   .option("relationship.source.labels", ":Customer")  .option("relationship.source.save.mode", "Overwrite")  .option("relationship.source.node.keys", "CUST_NO")  .option("relationship.target.labels", ":Customer")  .option("relationship.target.node.keys", "interest_NO")  .option("relationship.target.save.mode", "Overwrite")  .save()   If you create code like this, duplicate nodes will inevitably occur.Also, in the Neo4j official guide, there is no code to create only a relationship. To sum up, my request is as follows. 1. If the existing created node has the same id value, the node is not created.(Duplicate nodes are prohibited) 2. Create a relationship between the created nodes.(merge realization function)     All-in-One Data Business Platform [Consulting + DATAWARE + Service + Education]  (주)엔코아, EN-CORE, Total Data Business Company 서울특별시 서초구 서초대로46길 42번지 (서초동 1550-12)  이상윤 선임 Sangyun Lee  /  DSC팀  T. 02-754-7301M. 010-6573-2577 E. ***@***.***      보낸 사람: Davide Fantuzzi보낸 날짜: 2021년 9월 2일 목요일 오후 6:26받는 사람: neo4j-contrib/neo4j-spark-connector참조: LeeSangYun; Mention제목: Re: [neo4j-contrib/neo4j-spark-connector] If a node has already been created, you want to establish a relationship without creating a node, but duplicate nodes occur. (#383) Hi @MustangYun, can you share the Apache Spark code that causes this issue? Which version of Spark, Scala and connector are you using?Thank you—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android.  
conker84 commented 3 years ago

Did you define constraints on Customer(CUST_NO) and Customer(interest_NO)?

conker84 commented 3 years ago

@MustangYun ping :)

conker84 commented 2 years ago

Closed for lack of feedback feel free to reopen it in future