neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Allow arbitrary integer class ids in NodeClassification #263

Closed devineyfajr closed 1 year ago

devineyfajr commented 1 year ago

Is your feature request related to a problem? Please describe.

Class ids (apparently) must be sequential integers starting with zero.

Describe the solution you would like

User should be able to use any arbitrary set of integers, possibly non-sequential, as class ids.

Describe alternatives you have considered

First, can't find evidence of the requirement for sequential integers starting with 0 in the docs. Fix that, or if it's there, make it more pronounced. Probably for targetProperty here: https://neo4j.com/docs/graph-data-science/current/machine-learning/node-property-prediction/nodeclassification-pipelines/training/#_syntax. Second, emit a WARNING or ERROR if a user prescribes non-conforming class ids.

Additional context

Running the attached script gives different results just by changing the class ids from [0,1,2] to something else, like [0,1,12]. Some results are at the end of the script. testNC.txt

brs96 commented 1 year ago

Hi @devineyfajr ,

Thanks for reporting this issue. This restriction was not listed in the doc and it is certainly desirable to make it work for non-consecutive ids. We've pushed a fix that'll come out in the next patch. Before that, please use consecutive class ids from 0 if possible for the most reliable results.

Side note: In the new patch, is full reproducibility across multiple node classification trainings achievable? In short, no. Suppose after a run you cleanup the DB and reproject the same graph (maybe with class values changed). The elementIds assigned by neo4j DB will be different, and the ordering of the nodes by their ids could change. This means after graph projection, since id ordering is different, the different nodes get put into train/test and validation folds splits, hence producing potentially slightly different scores. However, given larger graphs, with good features and embeddings, scores for each training should be very close.

brs96 commented 1 year ago

Hi @devineyfajr ,

This feature is now implemented in 2.3.3 release under the changelog entry: Multiclass node classification compatible with non-consecutive class ids. Let us know if it fixes your problem. If you see any other issue, feel free to open a new issue.

Thanks!