isagalaev / ijson

Iterative JSON parser with Pythonic interface
http://pypi.python.org/pypi/ijson/
Other
615 stars 134 forks source link

parse context '*.item' is ambiguous #7

Closed jnothman closed 12 years ago

jnothman commented 12 years ago

One callback missing from yajl is an equivalent of yajl_map_key for arrays, i.e. yajl_array_item. The fact that ijson.parse() indicates the context has the potential to remedy situations where such a callback would be useful. However, the context is unreliable, due to the ambiguity when "item" is a key:

>>> pprint.pprint(list(ijson.parse(StringIO.StringIO('[{"item":"val"}]'))))
[('', 'start_array', None),
 ('item', 'start_map', None),
 ('item', 'map_key', 'item'),
 ('item.item', 'string', u'val'),
 ('item', 'end_map', None),
 ('', 'end_array', None)]
isagalaev commented 12 years ago

Though I can't say anything for the developer of YAJL I still don't think there's any need for such a callback: any value between start_array and end_array is an array item. Conveniently handling this context is outside of the scope of YAJL itself.

I also can't see a situation with the ambiguity you mention. You're usually know the structure of the JSON you're parsing. In other words, I don't see what's wrong with the output you quoted: if I need whole objects I'll filter values by the "item" prefix, if I need the "item" key inside those objects I'll look for "item.item". The only highly theoretical problem I could conceive is having an array with both objects and arrays as items:

[
  {"item": "value"},
  ["other"]
]

Then item.item would mean different types of data. But this highlight unlikely scenario is best handled by an application that has this problem by adding custom markers to the context, for example. I don't see how to solve it in a generic way.

jnothman commented 12 years ago

No, there isn't necessarily a nice way to do it, though arguably, 0, 1, etc. might be more comparable to keys than item.

But "any value between start_array and end_array" requires keeping a stack (because what you meant is "any value locally between start_array and end_array). yajl does not keep a stack, but ijson does.

isagalaev commented 12 years ago

0, 1, .. aren't free from the same problem you're describing: it's still ambiguous with an object having those keys. What worse, though, is that it'd break the main usecase of going through array items given their common prefix:

for item in ijson.items(file, 'item'):
    # ...

You should have a common name for all the items for this to work.

About keeping a stack — yes, this is what I meant by saying that an application can do this. It can use the stack information that ijson provides and augment or modify it in whatever way it sees fit. I can't see a way to do this in the library itself because of all the reasons I mentioned: I don't see any practical problem with current approach, I can't invent another approach without understanding a usecase and the only proposed solution with numbered items simply doesn't work.

jnothman commented 12 years ago

Yes, I'm aware it has the same ambiguity. Now I see the intended use-case (different from what I had considered), it's fair enough. I just thought the minor issue was worth pointing out.

Ideally you could return the stack of keys and offsets, rather than a joined string of them. Specifying that you'd like to iterate through all items would need a cleaner and more precise specification.