Add parameter to limit depth of recursion across relationships

grant-humphries commented 7 years ago

Hi folks, I'm attempting to upgrade from version 0.3.1 to 0.3.3. The recursion across all relationships to create schema nodes for more distantly related tables (as opposed to just primary relationships) is one of the features that is leading me to upgrade, but the ORM that I'm working with is pretty massive. When calling SQLAlchemySchemaNode against one of our tables I let the code run for a minute or so and add_nodes method got called just under 41,000 times. I've let it run for several minutes longer than that, but the task I was attempting never completed and python was chewing up a pretty good chunk of memory during that time.

In light of this I propose that a recursion depth parameter be added that limits how many tables away from the root an relationship can be and still be added to the schema. It may be a good idea to have a default value in place for this as well to avoid situations like the one I describe.

tisdall commented 7 years ago

Ideally the best solution would be for it stop if it reaches a node it's already visited. However, that might be complicated to implement and requires keeping track of all visited nodes.

A recursion depth count would stop it running infinitely, but the end result coming from the method would probably be unusable. It'd probably be good to have it throw an exception if the recursion goes beyond a given depth. (this would probably good to have either way)

I haven't worked on this project in years, though, so I'm not likely the best person to implement this. A recursion depth counter shouldn't be too hard to implement and I'd accept a PR for that if it had tests and docs.

grant-humphries commented 7 years ago

Thanks for taking the time to respond to this @tisdall. This package is pretty useful to my company so I might be able to convince them to allow me some time to create a pull request for this. A couple of responses to your comments:

Ideally the best solution would be for it stop if it reaches a node it's already visited. However, that might be complicated to implement and requires keeping track of all visited nodes.

It seems like a straight forward solution to this would be add the sqlalchemy class names to a set after each node visit. Since sets use hash tables the look-ups should still be pretty quick even if the ORM is really large. Does anything come to mind off hand that would circumvent this approach? It looks like something similar has been done with in prevent circular dependencies with parent_ parameter.

A recursion depth count would stop it running infinitely, but the end result coming from the method would probably be unusable.

Can you elaborate on why the schema may be unusable in those cases? It seems like it would be possible to stop adding relationships once they are more than x levels away from the root table.

In our case I can't imagine a scenario where we want tables more the three or four levels from the root added to the schema so it would still be nice to be able prevent the creation of large parts of the schema that are never intended to be used.

tisdall commented 7 years ago

@grant-humphries - What I meant was that if you had data/schema that went 8 layers deep but returned only 4, the result wouldn't be very usable as it's missing data and if you used that to edit the data the end result would save only 4 layers and truncate the rest. So, I think it should be an exception and leave it up to the implementer to deal with it as they want. In other words, it should always be the case that objectify(dictify(x)) should always return x if there's no exception.

In your specific example, you'd specify the limit to recursion to be 4 since you expect nothing to be deeper than that. Anything that went deeper would be an exception.

tisdall commented 7 years ago

It's been a while since I used SQLAlchemy so I can't remember how it handles some things... I don't think you can just look at class names as you could have a table that has foreign keys to itself. I think you'd have to look at the table and the primary keys to identify if it's the exact same row you've already visited.

maparent commented 7 years ago

This may sound simple-minded, but what about immediately setting the __colanderalchemy__ property of the class in the SQLAlchemySchemaNode initializer? That node is not complete yet, but it is already a valid reference. Then, when we get the relationship _class in get_schema_from_relationship (schema.py:460), we can check if the attribute is set and return it directly if so. The parents_ attribute will not be valid with this approach, but it does not seem to be used except to (unsuccessfully) avoid the loop.

tisdall commented 7 years ago

@grant-humphries - can you provide a failing example that we can run tests against? Take a look at the 'tests' directory here for how the tests work and see if you can modify one to mimic your data structure and hopefully replicate the problem.

Joishi commented 6 years ago

@tisdall - I'll take a look, as I am encountering the same issue

Joishi commented 6 years ago

@tisdall - Actually, looking at my models and your tests and doing some debugging, I'm not noticing any sort of recursive circle issues. What I am noticing is that it's recursing as deeply down the stack as necessary to create a full schema. My ORM contains about 400 models, many of which are intertwined. To create a full schema for 1 model ends up creating over 250k schemas (many of which are duplicates, just at different levels in different relationships).

I have added some additional models to the tests and added one more test to allow for includes to support dot notation. I will likely submit a pull request this week for review.

tisdall commented 6 years ago

@Joishi - How are you planning on handling the issue in the PR?

Joishi commented 6 years ago

I'm not sure what you mean? This is the first time I've worked on a shared project in Github, so I'm not familiar with the expected workflow. I have forked the repo and made some local changes (not pushed yet) .. was going to submit a pull request from my fork. Or maybe I'm not understanding what you are asking correctly. :)

tisdall commented 6 years ago

@Joishi - well, are you adding a parameter to limit the recursion, or are you implementing something else?

Joishi commented 6 years ago

@tisdall - I'm going to add support for dot notation of includes.. For example:

includes = [ 'col1', 'rel1', 'rel1.col1', ]

This will generate a schema that contains what you specify in the includes, and only those values. If you don't specify an includes, it will get everything by default (as it does now).

The above example would generate a schema that includes the relationship rel1, but only the col1 of that relationship. Any relationships THAT object has will not be traversed.

tisdall commented 6 years ago

@Joishi - Just be aware that I haven't touched anything related to this library for quite some time, so I can't guarantee that it'll be merged any time soon... I'll likely need some input from some other users before going ahead with a merge.

Joishi commented 6 years ago

@tisdall - I submitted the pull request but it failed on 2.6 support .. I'm guessing it has to do with the list comprehension I used. I will review what 2.6 support will require.

Oh ... or not. This is in the build log:

Exception: SQLAlchemy requires Python 2.7 or higher.

tisdall commented 6 years ago

Yeah, SQLA dropped support for Python 2.6 so we probably should too.

stefanofontanelli / ColanderAlchemy

Add parameter to limit depth of recursion across relationships #97