mahmoud / glom

☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️
https://glom.readthedocs.io
Other
1.89k stars 61 forks source link

How can I use glom to pick out zero, one, or more, array elements with a key that must match a value, and how to load this spec from file #225

Open davesargrad opened 3 years ago

davesargrad commented 3 years ago

Part 1

My target looks like this:

{
   "stuff":"is_cool",
   "yay":1,
   "a_list":[
      {
         "SOURCE TYPE":"Something",
         "ID":"https://blizzard.com"
      },
      {
         "SOURCE TYPE":"Something Else",
         "ARBITRARY":"Milk",
         "notes":[
            {
               "label":"chocolate",
               "comments":[
                  "yummy",
                  "yummiest"
               ]
            },
            {
               "label":"strawberry",
               "comments":[
                  "pink stuff",
                  "yay"
               ]
            }
         ]
      },
      {
         "SOURCE TYPE":"Something Else Else",
         "ALGORITHM":"Cool"
      }
   ]
}

I want to form a spec that will output just the element(s) of "a_list" that contain "SOURCE TYPE": "Something Else"

So the output would be this:

[
   {
      "SOURCE TYPE":"Something Else",
      "ARBITRARY":"Milk",
      "notes":[
         {
            "label":"chocolate",
            "comments":[
               "yummy",
               "yummiest"
            ]
         },
         {
            "label":"strawberry",
            "comments":[
               "pink stuff",
               "yay"
            ]
         }
      ]
   }
]

Keep in mind this is also a valid target (Two matching "a_list" array elements, so the output would contain a list of length 2),

{
   "stuff":"is_cool",
   "yay":1,
   "a_list":[
      {
         "SOURCE TYPE":"Something",
         "ID":"https://blizzard.com"
      },
      {
         "SOURCE TYPE":"Something Else",
         "ARBITRARY":"Milk",
         "notes":[
            {
               "label":"chocolate",
               "comments":[
                  "yummy",
                  "yummiest"
               ]
            },
            {
               "label":"strawberry",
               "comments":[
                  "pink stuff",
                  "yay"
               ]
            }
         ]
      },
      {
         "SOURCE TYPE":"Something Else",
         "ARBITRARY":"Soda",
         "notes":[
            {
               "label":"berry",
               "comments":[
                  "gross",
                  "yummierest"
               ]
            },
            {
               "label":"cherry",
               "comments":[
                  "soda is good stuff",
                  "cherry is like berry, but with a c"
               ]
            }
         ]
      },
      {
         "SOURCE TYPE":"Something Else Else",
         "ALGORITHM":"Cool"
      }
   ]
}

Is there a spec that can do this?

Part 2

Here, I want to preserve a portion of the higher level object as well

The target is the same, but I want the following output:

{
   "stuff":"is_cool",
   "yay":1,
   "a_list": [
   {
      "SOURCE TYPE":"Something Else",
      "ARBITRARY":"Milk",
      "notes":[
         {
            "label":"chocolate",
            "comments":[
               "yummy",
               "yummiest"
            ]
         },
         {
            "label":"strawberry",
            "comments":[
               "pink stuff",
               "yay"
            ]
         }
      ]
   }
]}

How would I achieve that?

Part 3

Here, I want to preserve a portion of the higher level object as well. I also want to leave out some fields, both in the higher level object, and in the matching array elements.

The target is the same, but I want the following output:

{
   "stuff":"is_cool",
   "a_list": [
   {
      "ARBITRARY":"Milk",
      "notes":[
         {
            "label":"chocolate",
            "comments":[
               "yummy",
               "yummiest"
            ]
         },
         {
            "label":"strawberry",
            "comments":[
               "pink stuff",
               "yay"
            ]
         }
      ]
   }
]}

How would I achieve that?

Part 4

I dont want to configure the spec in a hard-coded fashion. Rather I want to load it as a configuration. I could wrap the spec in a string and then do an eval. However I think this is bad practice.

As described in the documentation, I don't want to use code injection, nor do I want to use the command line interface for this. I'd rather load the spec from a configuration file, and still have access to the full power of the spec (so that I can use things like lambda functions, and methods such as Coalesce)

image

How do I load the glom spec from a configuration file, or from a string?

davesargrad commented 3 years ago

Ah Wow.. As seen in the snippets, this seems to do the trick for the first part of the question: glom([1, 2, 3, 4, 5, 6], [lambda i: i if i % 2 else SKIP])

I simply replace the target, with my target, and the integer modulo check (inside the spec) with if i['SOURCE TYPE'] == "Something Else"

spec = ('a_list', [lambda i: i if i['SOURCE TYPE'] == "Something Else" else SKIP])

I'm starting to love glom, and I only just met glom today!

I think I am good for Part 1. Could you please help with a spec for Part 2, Part 3, and Part 4.

davesargrad commented 3 years ago

Looks like Parts 2 and 3 are also easy.

Something like this does the job.

{'time': ('time'), 'a_param': ('size'), 'another_param': ('shape'), 'sublist': ('a_list', [lambda i: i if i['SOURCE TYPE'] == 'Something Else' else SKIP])}

So at this point I just need an answer for Part 4.

kurtbrose commented 2 years ago

That's a great question :-) You're knocking on the door of very universal computer science issues.

Can we load an arbitrary spec WITHOUT eval? A spec can embed arbitrary python objects and functions so cannot be represented without "full power" python.

My practical recommendation would be to have config.py or transformers.py or similar file where you store the data. Then, it's up to you be convention to keep the code "simple". This is how e.g. gunicorn and django handle configuration, and I've found it to work well.

Could you have a LIMITED spec and load parts of it from JSON or similar? Yes, absolutely. I think it will end up being less readable than using a python-syntax config file, but it could be done.

vineetsingh065 commented 1 year ago

@davesargrad I also encountered this issue, did you find any solution, My problem is similar to Part 2 scenario.