johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.98k stars 216 forks source link

cut and regex: how to apply it to a flattened JSON? #737

Closed aborruso closed 2 years ago

aborruso commented 3 years ago

Hi, I have this input

{"a":3,"b":["a","c"]}

If I run mlr --j2c cat input.json I have

a,b.1,b.2
3,a,c

I want to cut, all the fields that contain .. Then I change my command in:

mlr --j2c cut -r -f "\." input.json

#or

mlr --j2c cut -r -f "[.]" input.json

but I have no result. How to apply regex to the output symbol of a flattened JSON input (the .)?

I'm using Miller 6.

Thank you

johnkerl commented 3 years ago

@aborruso this is one of those things which I fear will confuse people about Miller 6 ... yet I do (I think ... more below) want this behavior. Namely, by default, the field name internally to Miller is b (not b.1 and b.2) -- because b is an array-valued field. You can see this with

$ mlr -j cat input.json
{
  "a": 3,
  "b": ["a", "c"]
}

The flattening to b.1 and b.2 happens at the very end of verb-processing when the fields are "flattened" down since CSV is the output format, and CSV doesn't support nested values.

More info is on flatten/unflatten at https://miller.readthedocs.io/en/latest/flatten-unflatten

If you want to remove field names with . in them you would need to flatten before the cut, so

$ mlr --j2c flatten then cut -rf '\.' input.json
b.1,b.2
a,c

$ mlr --j2c flatten then cut -rxf '\.' input.json
a
3

The reason I think I want this behavior by default is I think it's the most natural way to go back and forth between JSON & CSV.

But I may be wrong.

Auto-flatten/auto-unflatten is (currently) the default behavior. Without that we would have

$ mlr --j2c --no-auto-flatten cat input.json
a,b
3,"[""a"", ""c""]"

which is (to me anyway) less pleasant.

But that's only my own viewpoint & I'm eager to hear other points of view on what default behavior for flatten/unflatten should be ...

aborruso commented 3 years ago

Hi @johnkerl thank you very much. It's very clear and it makes a lot of sense; reading the guide I did not understand it, my fault.

I often get a little crazy trying to flatten file as the one below. I think with Miller 6 it will be even easier than with version 5.

[
  {
    "city": "Palermo",
    "id": 1,
    "country": "Italy",
    "population": 5000000,
    "monuments": [
      {
        "name": "Church of St. Peter",
        "type": "church",
        "address": "Viale della Chiesa, 1"
      },
      {
        "name": "Church of St. Mary",
        "type": "church",
        "address": "Viale del Tramonto, 5"
      }
    ]
  },
  {
    "city": "Milan",
    "id": 2,
    "country": "Italy",
    "population": 6000000,
    "monuments": [
      {
        "name": "Hello world",
        "type": "tower",
        "address": "Via Roma, 24"
      },
      {
        "name": "Church of St. Paul",
        "type": "church",
        "address": "Via da qui, 5"
      }
    ]
  }
]

My stupid goal (stupid because I think it's immediate for you), is to produce using only Miller these two outputs:

mlrgo --j2m flatten then cut -x -r -f "\." 2flatten.json
city id country population
Palermo 1 Italy 5000000
Milan 2 Italy 6000000

and

mlrgo --j2c flatten then cut -r -f "(id|\.)" then reshape -r "\." -o k,v then put -S '$idMonument=sub($k,"^(.+?)\.([0-9]+)\.(.+)$","\2");$k=sub($k,"^.+[0-9]?\.","")' then reshape -s k,v input.json
id idMonument name type address
1 1 Church of St. Peter church Viale della Chiesa, 1
1 2 Church of St. Mary church Viale del Tramonto, 5
2 1 Hello world tower Via Roma, 24
2 2 Church of St. Paul church Via da qui, 5

Every time I get into it, I think there is some easier way to do it using Miller, but I can't find it.

Forgive me for this long off-topic message.

Your answer is clear, I have to use flatten.

johnkerl commented 3 years ago

@aborruso great!

Forgive me for this long off-topic message.

I think it's quite on-topic :) -- if it's confusing/hard for you, it will certainly be so for others.

Your answer is clear, I have to use flatten.

Flattening is certainly an issue, but not the only option (not one you have to use) ... there may be an easier way to get what you want ...

I think what you want to do with cutting out field names that have a . in them is to find array-valued and map-valued fields -- ? So perhaps

mlr --j2m --from input.json put -q '
  for (monument in $monuments) {
    outrec = {
      "id":$id,
      "name": monument.name,
      "type": monument.type,
      "address": monument.address
    };
    emit outrec;
  }
'

or perhaps

mlr --j2m --from input.json put -q '
  for (monument in $monuments) {
    outrec = {"id":$id};
    outrec = mapsum(outrec, monument);
    emit outrec
  }
'
id name type address
1 Church of St. Peter church Viale della Chiesa, 1
1 Church of St. Mary church Viale del Tramonto, 5
2 Hello world tower Via Roma, 24
2 Church of St. Paul church Via da qui, 5

along with

$ mlr --j2m cut -xf monuments 2flatten.json
city id country population
Palermo 1 Italy 5000000
Milan 2 Italy 6000000

(Also I can see how https://github.com/johnkerl/miller/issues/546 would help here!)

Does this help?

johnkerl commented 3 years ago

Even better would be

mlr --j2m --from input.json put -q '
  for (monument in $monuments) {
    emit1 mapsum({"id": $id}, monument);
  }
'

-- this is one of the few remaining blockers for Miller 6, letting the mapexcept(...) (or any other expression evaluating to a map, which isn't already a map) be directly emitted.

aborruso commented 3 years ago

@johnkerl you are always very kind!!

It seems great and you made my day!

Now I must study and try. Thank you

johnkerl commented 3 years ago

@aborruso https://github.com/johnkerl/miller/pull/739

So now this works:

mlr --j2m --from input.json put -q '
  for (monument in $monuments) {
    emit1 mapsum({"id": $id}, monument);
  }
'
aborruso commented 3 years ago

So now this works:

👏👏👏Do you hear this noise? I'm standing here in my room applauding you!!

It's really great. I'm not closing it, because I would like to continue this thread, starting with a more complex input file, and understand if I'm able to write the correct syntax.

Thank you!!!

aborruso commented 3 years ago

Dear John, my real Json file has the below kind of structure.

My first goal could be extract something like

city id monument name monument type monument address
1 Church of St. Peter church Viale della Chiesa, 1
... ... ... ...

This JSON is more complex because monuments is not always an array.

If I run

 mlr --j2m --from input.json put -q '
  for (city in $country.cities) {
    emit1 mapsum({"id": city.id}, city.monuments);
  }
'

I have only one monument, the one for the city in which monuments is not an array:

id name type address
3 Lorem ipsum Building Via Chocchi, 3
{
  "country": {
    "name": "Italy",
    "cities": [
      {
        "name": "Palermo",
        "id": 1,
        "population": 5000000,
        "monuments": [
          {
            "name": "Church of St. Peter",
            "type": "church",
            "address": "Viale della Chiesa, 1"
          },
          {
            "name": "Church of St. Mary",
            "type": "church",
            "address": "Viale del Tramonto, 5",
            "websites": [
              {
                "name": "A great website",
                "type": "Official website",
                "url": "https://www.stmary.it"
              },
              {
                "name": "stmary",
                "type": "Twitter",
                "url": "https://twitter.com/stmary"
              }
            ]
          }
        ]
      },
      {
        "name": "Milan",
        "id": 2,
        "population": 5000000,
        "monuments": [
          {
            "name": "Hello world",
            "type": "tower",
            "address": "Via Roma, 24"
          },
          {
            "name": "Church of St. Paul",
            "type": "church",
            "address": "Via da qui, 5",
            "websites": [
              {
                "name": "A bad website",
                "type": "Official website",
                "url": "https://www.hword.it"
              }
            ]
          }
        ]
      },
      {
        "name": "Napoli",
        "id": 3,
        "population": 3000000,
        "monuments": {
          "name": "Lorem ipsum",
          "type": "Building",
          "address": "Via Chocchi, 3"
        }
      }
    ]
  }
}

If my Json was simpler as this one:

{
  "country": {
    "name": "Italy",
    "cities": [
      {
        "name": "Palermo",
        "id": 1,
        "population": 5000000,
        "monuments": [
          {
            "name": "Church of St. Peter",
            "type": "church",
            "address": "Viale della Chiesa, 1"
          },
          {
            "name": "Church of St. Mary",
            "type": "church",
            "address": "Viale del Tramonto, 5"
          }
        ]
      },
      {
        "name": "Milan",
        "id": 2,
        "population": 5000000,
        "monuments": [
          {
            "name": "Hello world",
            "type": "tower",
            "address": "Via Roma, 24"
          }
        ]
      }
    ]
  }
}

If I run

 mlr --j2m --from input.json  put -q '
  for (city in $country.cities) {
    emit1 mapsum({"id": city.id}, city.monuments);
  }
'

I have an empty result.

It is evident that I have yet to understand the basics, that I have yet to understand how to flow in the structure of the JSON.

In the next few days, I will continue and update you here.

Thank you

aborruso commented 2 years ago

I'm closing it. I must study :)