mahmoud / glom

☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️
https://glom.readthedocs.io
Other
1.88k stars 61 forks source link

Inconsistent Delete Behavior #246

Open ESattler opened 1 year ago

ESattler commented 1 year ago

Version: 22.1.0

I noticed in my larger script inconsistent behavior where it was throwing an error deleting when it shouldn't, so I recreated a smaller example:

from collections import OrderedDict

test = OrderedDict([
    ('name', 'dog'), 
    ('test', '12345'), 
    ('inputs', OrderedDict([
        ('animal', 'cat'), 
        ('test-value', 'test'),
    ])), 
    ('test2', [OrderedDict([
        ('name', 'dog2')
    ])])
])

import pprint

pprint.pprint(test)

import glom

print(glom.glom(test, "inputs.animal"))
glom.delete(test, "inputs.animal")

pprint.pprint(test)

If I run this over and over, sometimes it works and sometimes it doesn't and throws an error

glom.mutation.PathDeleteError: error raised while processing, details below.
 Target-spec trace (most recent last):
 - Target: OrderedDict([('name', 'dog'), ('test', '12345'), ('inputs', OrderedDict([('animal', 'c... (len=4)
 - Spec: Delete(Path('inputs', 'animal'))
 - Spec: Path('inputs')
glom.mutation.PathDeleteError: could not delete 'animal' on object at Path('inputs'), got error: AttributeError('animal')

Which doesn't make any sense. it successfully prints out what is at inputs.animal but fails then to delete that path and it's inconsistent. Out of 10 runs in a row, 3 of them failed for deleting.

I tried version 20.11.0 too out of curiosity and it didn't seem to change anything.

ESattler commented 1 year ago

Digging into it deeper, I've figured out that when it fails, it's because here _delete returns a <built-in function delattr> but when it passes, it is <built-in function delitem>

kurtbrose commented 1 year ago

I suspect what is happening here is that delete is mutating the dict, and the second time you run it, it fails.

Here's a quick example script, you can see as long as we use a fresh input it works, but if we re-use the same data structure twice we get that exact error.

from collections import OrderedDict
import glom

def build():
   return OrderedDict([
      ('name', 'dog'), 
      ('test', '12345'), 
      ('inputs', OrderedDict([
          ('animal', 'cat'), 
          ('test-value', 'test'),
      ])), 
      ('test2', [OrderedDict([
          ('name', 'dog2')
      ])])
   ])

glom.delete(build(), "inputs.animal")
glom.delete(build(), "inputs.animal")
glom.delete(build(), "inputs.animal")

test = build()
glom.delete(test, "inputs.animal")
glom.delete(test, "inputs.animal")

You mentioned this was a minimization of an error in a more complex case. Maybe the original case is different?

Also, does delete destructively mutating its input not what you expected? Were you expecting it to return a mutated copy?

ESattler commented 1 year ago

@kurtbrose So in my actual use case we are reading in a YAML file and modifying it with glom to delete stuff and then dumping it back. I noticed sometimes it would update and sometimes it wouldn't and it's because I had ignore missing set to true which was masking the error of it not finding the element even though it existed.

In my script example above, I was actually just straight running that exact script over and over and sometimes it would fail and sometimes it wouldn't but as you can see, I'm not using the same dictionary. Start of the script we initialize test and then perform actions on it. So maybe it could be a memory issue with Python? But I am fresh running those exact lines and seeing it inconsistently work and fail randomly.

Out of curiosity, did you run my example above? I'm curious if you can replicate it. I actually decided to try this on my home computer running Windows and using the Windows Subsystem for Linux and I can't get it to fail. When it was failing before, it was on a Macbook and the other people who tested it for me were also running Mac, so it makes me wonder if some weird memory or caching issue with Mac and running the script.

mahmoud commented 1 year ago

Hey! So I tried this back in November and had no luck, but now on 3.10 I think I've repro'd the behavior! It's super late here, and the issue is super funky, so forgive these rough notes.

First, a bit about my repro. When the script above is run with python -m pdb, it succeeds. Without the -m pdb it fails. Adding unconditional prints in glom core, same effect; remove the prints, and the failure came back. Always on the second try though, which makes me suspect there's something ugly in __pycache__. Some sort of import side effect? Likely something to do with dict/set random seed (does pdb use a consistent seed?). Very wild!

Next: Quick fix. Explicitly register the OrderedDict type here. We'd been relying on dict functionality to pick it up, but clearly that doesn't always work as intended. Why does this work?

At the moment, when glom encounters a type that it doesn't recognize, it tries to find the closest type it knows about. Then it attempts to use the operation handler associated with that. This has two implications:

  1. Unrelated abstract data types can win this "closest type" check. Here, ObjStyleKeys was being declared the closest type to OrderedDict, due to some nondeterminism somewhere.

More importantly:

  1. Autodiscover functions are only run against explicitly registered types. Delete's autodiscover reaches the correct conclusion about OrderedDict. Might need to rethink this, especially now that we have op type caching.

Anyways, super interesting bug, thanks for reporting it and making such a great minimal case! This was a tough one to report and I'm really glad you did.

I'll push the quick fix for OD momentarily, for release with glom 23 this week. Deeper fix to follow. Thanks again!

kurtbrose commented 1 year ago

Oops, I see now I missed in the original report:

got error: AttributeError('animal')

That means it was trying to do __delattr__ on an attribute, not __delitem__ on a key.

A temporary work-around would be to use a T; that removes ambiguity and glom doesn't try to infer which operator is correct for the type.

glom.delete(test, T["inputs"]["animal"])