Illustration example on 'Configure split' procedure in ML link prediction pipeline

meg261995 commented 1 year ago

Hi All,

I need some help understanding what exactly happens in 'configure split' procedure in ML link prediction pipeline. All the other steps are neat and clear except for the 'Configuring the relationship split' part. Having a real difficulty here in understanding how it works without any illustration with an example graph. Example code is given but there is no explanation. It would be really helpful if someone could give an example or a reference.

Getting the below error while training the Link Prediction pipeline.

Failed to invoke procedure gds.beta.pipeline.linkPrediction.train: Caused by: java.lang.IllegalArgumentException: The relationship types ['TEST', '_TESTCOMPLEMENT'] are in the input graph, but are reserved for splitting.

For me to understand this error, I need to have a complete understanding of how the splitting exactly happens. The explanation in the doc is okay but need an example.

adamnsch commented 1 year ago

Hi @meg261995,

This is a very reasonable request as the LP dataset splitting is fairly complex. We've actually been meaning to add an example for it (like we have done for node classification), I think we have simply forgot.

We will try to do this as soon as we can and get back to you.

Thanks for the feedback, Adam

meg261995 commented 1 year ago

Awesome. I am working on a real time project and this is a dependency I currently have. Hope to see the example soon. Thankyou :)

adamnsch commented 1 year ago

Awesome. I am working on a real time project and this is a dependency I currently have. Hope to see the example soon. Thankyou :)

I see! Well for now you could check out the example in for the split relationships auxiliary procedure. The example is not great though, but the logic is the same as in the LP pipelines and it might be of use to you.

meg261995 commented 1 year ago

Sure, thanks

Mats-SX commented 1 year ago

@meg261995 Just to comment on the error message: the relationship types TEST and TEST_COMPLEMENT must not exist in the graph that you are using for training your pipeline. Why not? Because during the splitting phase, the pipeline executor will separate (split) relationships into several groups, which each are given a relationship type. These types are hard-coded, and when they are added to the graph for the duration of the train procedure there must not be something already there with the same names.

Normally, this shouldn't be a hindrance; we don't expect users to have relationship types with these names most of the time. If you do have relationships with these types, you must rename them in your projection to something else, like TEST-original or something.

The error may also be caused by running multiple pipelines at the same time on the same graph. Then one pipeline's splitting will conflict with another's -- this is not supported. Only one pipeline should run at the same time. You can use gds.beta.listProgress() to inspect which operations are currently running.

Hope this can be helpful.

meg261995 commented 1 year ago

Thanks

FlorentinD commented 1 year ago

Closing this issue, as we added an illustrated example at https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/#_example_4

neo4j / graph-data-science

Illustration example on 'Configure split' procedure in ML link prediction pipeline #216