neo4j-contrib / neomodel

An Object Graph Mapper (OGM) for the Neo4j graph database.
https://neomodel.readthedocs.io
MIT License
936 stars 231 forks source link

concurrent async calls of `get_or_create` lead to duplicate nodes #807

Open AnikinNN opened 3 months ago

AnikinNN commented 3 months ago

Expected Behavior (Mandatory)

At concurrent calling of get_or_create from different places of a program we get the same node which was created once

Actual Behavior (Mandatory)

At concurrent calling of get_or_create from different places of a program we can get different nodes which we consider as duplicates

How to Reproduce the Problem

Run the code in example. Probably you should increase the amount of created nodes if your's hardware better than mine. I wrapped code into for loop because sometimes the bug can not be caught on the first attempt. Looks like race condition.

Simple Example

As a simple alternative to concurrent calls from different places of entire program we can use asyncio.gather

import asyncio

import neomodel.config
from neomodel import AsyncStructuredNode, StringProperty

neomodel.config.DATABASE_URL = 'bolt://neo4j:12345678@neo4j:7687'
name = 'some_name'

class SomeNode(AsyncStructuredNode):
    name = StringProperty(required=True)

async def main():
    for counter in range(100):
        print(f'iteration {counter}')

        # cleanup
        await neomodel.adb.cypher_query("MATCH (n) DETACH DELETE n")

        # create a lot of the same nodes
        # this should return a lot of instances with the same id
        created_nodes = await asyncio.gather(*(
            SomeNode.get_or_create(dict(name=name))
            for _ in range(1000)
        ))

        # check that all instances have the same id
        ids = set(i[0].element_id_property for i in created_nodes)
        assert len(ids) == 1

        # check that there is only one node in the neo4j
        assert len(await SomeNode.nodes) == 1

if __name__ == '__main__':
    asyncio.run(main())

Specifications (Mandatory)

Currently used versions

Versions

-OS: linux, 3.10.14-slim -Library:5.3.0 [extras] -Neo4j:5.19.0 community

mariusconjeaud commented 3 months ago

Hmmmm... Yes, it makes sense the this would happen, as it is a kind of race condition. The thing is, get_or_create is intended for batch operations, which should not be expected to avoid such collisions. Batch operations can run parallel, but then the user has to ensure that a given object is not shared across different parallel batches.

So... I don't see a way around this. Do you have an idea ?