mahmoud / glom

☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️
https://glom.readthedocs.io
Other
1.9k stars 61 forks source link

Traverse glom #45

Open kurtbrose opened 6 years ago

kurtbrose commented 6 years ago

the job of a Traverse is to walk its target recursively and return an iterator over all of the bits (as in depth-first or breadth-first traversal) -- this could perhaps share some bits with TargetRegistry

this is very useful when combined with Check and Assign for a kind of pattern-matching strategy:

# not sure if Traverse even needs an argument or if it should just implicitly walk current target
# maybe the argument should specify what it iterates over:  just items, items + paths, etc
glom(target, (Traverse(T),  (Check(T.val, validate=lambda t: t<0), Assign('val', 0)))
                                                                                   # ensure T.val >= 0

if there was an un-traverse glom possible, that would be even more powerful; but in the absence of that being able to do something to the items being traversed is still useful

the ultimate goal of this kind of approach is a useful meta-glom -- you can imagine transformations like "set all defaults to a unique marker object that stores the path" to debug why an output is coming as None

the ultimate, ultimate goal being useful glom-macros (glomacro?) and glom-compilation (glompilation?)

kurtbrose commented 5 years ago

This idea has evolved a bit -- call it PathEnumerate now, and its job is to dissect a target out into a list of (T, object) pairs.

e.g.

glom([ {'hello': 'world'}, {'goodbye': 'world'}], PathEnumerate())

would result in

[ (T[0], {'hello': 'world'}),
  (T[0]['hello'], 'world'),
  (T[1], {'goodby': 'world'}),
  (T[1]['goodbye'], world)
]

again, the goal is to make glom-specs that mutate glom-specs possible by allowing reasonable specs that operate on an arbitrarily nested structure

roryhr commented 5 years ago

I needed this feature for my usecase of GDPR. Anywhere I find the key email I gotta remove it -- and it could moving around and hiding!

paths = catalog(target)
# filter paths with regex like '.*email'
for path in paths:
    glom.assign(target, path, None)
mahmoud commented 5 years ago

Hey @roryhr! This feature is still coming to glom, but in the meantime you can do what Kurt and I do and use an earlier design, called remap: http://sedimental.org/remap.html#drop-empty-values

It's a bit trickier to use, but it's perfect for cases like yours (similar to the one linked above). Hope this helps!

kurtbrose commented 4 years ago

https://www.w3schools.com/xml/xpath_syntax.asp

traverse should also be able to do an XPath like syntax to filter output (or, if not traverse, something that can be used with traverse very easily)

if the output of traverse is [(path, element)], then the output could be filtered with Match(path) -- however, wildcard is a bit trickier

in XPath, . is "current node" and * is "any number of nodes" -- I'd propose switching these to * and ... for glom, since I think these are more familiar to glom's audience from file system globbing and use of ... in python [] syntax

kurtbrose commented 4 years ago

another thing that XPath syntax makes a great deal of is "attributes" vs "path"

here's a good acid test for capability I think: image

one way this could be expressed is

('0.bookstore.book', And(('price', M > 35), 'title'))

a bit more of a mouthful than

/bookstore/book[price>35.00]/title

kurtbrose commented 4 years ago

come to think of it... maybe there's something here we want kind of a multi-fetch rather than a pure traverse

what if path supported a '*' syntax which switched it from returning a single result to an iterable of results?

outside of XML land, every node doesn't implicitly have multiple children that you can only refer to by type...

what if this

Path('bookstore.books.*')

was a short-hand for an iterable of results

('bookstore.books.*', [And(('price', M > 35), 'title') | SKIP])

maybe something like that?

kurtbrose commented 4 years ago

then, '...' path segment would trigger a recursive walk

Path('a...b')  # return all 'b's at any level from 'a'

one challenge here is that now the path is unknown if e.g. you want to emit that; we could cover that by making S[Path] contain the actual path

then, the "plain" Traverse above would translate to Path('...')

kurtbrose commented 4 years ago

I guess "get all paths and values" would be ['...', Fill( (S[Path], T) )]

kurtbrose commented 4 years ago

another helper that would be super useful in case of e.g. the GDPR email thing would be Replace() -- assuming we get the invariant on S[Path] right, this would be equivalent to

Assign(Path(S[PARENT][T]) + S[Path], newval) 

or something like that -- on parent of current target, replace current target with new value

I guess the problem with leaning on S[Path] here is that it makes the resulting spec extremely context sensitive

maybe if there was a way to back out instead?

Path('...email..')

this would express, find any paths that go through an attribute named "email", then "back up" one level to the parent

(Path('...email..'), Assign('email', newval)

this would be, go to everywhere with email, then replace with newval

...if we allowed a mechanism for embedding regex...

(Path('...{.*email}..'), Assign(S[Path][-1], newval)
kurtbrose commented 4 years ago

so I really like that syntax as a top-level; but probably also want to make sure it decomposes into nice bits and Path doesn't just become super complicated and magical

kurtbrose commented 4 years ago

per discussion:

* and ** are probably better than * and ... (avoids colliding with . path demarcation)

some related:

https://github.com/mahmoud/glom/issues/89 -- solved by **

https://github.com/mahmoud/glom/issues/40 -- similar to GDPR use case above

https://github.com/mahmoud/glom/issues/39 -- not sure if this would address that, but there's a similar solution proposed of walk-with-path

kurtbrose commented 4 years ago

what would the stand-alone names for * and ** be? Glob() and RGlob() (recursive-glob)?

kurtbrose commented 4 years ago

Traverse() and Reverse() (recursive traverse)

kurtbrose commented 4 years ago

maybe Tread() and Retread()?

kurtbrose commented 4 years ago

Iter() and DeepIter()?

kurtbrose commented 4 years ago

Every() and REvery()? All() and RAll()? Each() and Reach()?

kurtbrose commented 4 years ago

I kind of like Each() and Reach()

kurtbrose commented 4 years ago

https://github.com/mahmoud/glom/pull/144

GlenDC commented 3 years ago

Hello. Is the plan still to implement traverse at some point? Any helpt required for this?