kgiszczak / shale

Shale is a Ruby object mapper and serializer for JSON, YAML, TOML, CSV and XML. It allows you to parse JSON, YAML, TOML, CSV and XML data and convert it into Ruby data structures, as well as serialize data structures into JSON, YAML, TOML, CSV or XML.
https://shalerb.org/
MIT License
618 stars 19 forks source link

Nested data in CSV #15

Closed tbsvttr closed 1 year ago

tbsvttr commented 1 year ago

Hi! It's me again!

As already mentioned in the other issue, I am very interested in the CSV serialization capabilities of this (cool!) library.

However, it seems that the CSV serialization is not working in a very useful manner when data is nested.

As an example, given this setup:

  address = Address.new(city: "San Francisco")
  person = Person.new(first_name: "John", last_name: "Doe", address: Address.new(city: "San Francisco"))
  person_presenter = PersonPresenter.new(person)

These are the results:

person_presenter.to_csv == <<~CSV.chomp
    John,Doe,,false,[],"{""city""=>""San Francisco"", ""street""=>nil, ""zip""=>nil}"
CSV

person_presenter.to_json == <<~JSON.chomp
    {\"first_name\":\"John\",\"last_name\":\"Doe\",\"married\":false,\"hobbies\":[],\"address\":{\"city\":\"San Francisco\"}}
JSON

person_presenter.to_xml == <<~XML.chomp
    <person>
        <first_name>John</first_name>
        <last_name>Doe</last_name>
        <married>false</married>
        <address>
            <city>San Francisco</city>
        </address>
   </person>
XML

Note how the JSON and XML can sensible handle the nested data structure, but the CSV now actually contains a JSON string in a field. My guess is that this will not be acceptable for most scenarios where CSV is actually used. Also note that in this JSON string, the nil values are rendered, while they are not rendered anywhere else. No render_nil was used anywhere. This lets me guess that this is somewhat of an unhandled or unintended behaviour.

My suggestion would be to flatten the structure so that it will be compatible with CSV. Something like this:

Or withrender_nil: true:

<<~CSV.chomp
    John,Doe,false,[],San Francisco,,
CSV

Note that it render an empty string into each field, so that the table which is represented in CSV does maintain is coherence for each row.

With the following headers:

<<~CSV.chomp
    first_name,last_name,married,hobbies,address.city,address.street,address.zip
CSV

What do you think? Or did I overlook some feature of the library?

kgiszczak commented 1 year ago

well, that's because CSV is a flat format, it doesn't support "nesting", you can work around it by using methods to extract/generate data, that way you have full control. Here's an example:

class Address < Shale::Mapper
  attribute :street, Shale::Type::String
  attribute :city, Shale::Type::String
end

class Person < Shale::Mapper
  attribute :name, Shale::Type::String
  attribute :address, Address, default: -> { Address.new }

  csv do
    map 'name', to: :name
    map 'street', using: { from: :street_from_csv, to: :street_to_csv }
    map 'city', using: { from: :city_from_csv, to: :city_to_csv }
  end

  def street_from_csv(model, value)
    model.address.street = value
  end

  def street_to_csv(model, doc)
    doc['street'] = model.address.street
  end

  def city_from_csv(model, value)
    model.address.city = value
  end

  def city_to_csv(model, doc)
    doc['city'] = model.address.city
  end
end

person = Person.from_csv(<<~DATA)
John Doe,Oxford Street,London
DATA

# => #<Person:0x2a06a
#    @name="John Doe"
#    @city="London",
#    @street="Oxford Street">

person.to_csv

# => "John Doe,Oxford Street,London"
tbsvttr commented 1 year ago

Thanks for your quick reply! I just edited my initial post again to improve upon it and showing that there is a bit of a problem with nil values in CSV, too.

well, that's because CSV is a flat format, it doesn't support "nesting",

That is absolutely correct! However, shouldn't default behaviour create a somewhat acceptable and unsurprising result? Just having a JSON string which is created with different rules as the actual to_json result seems kinda wired to me. Why not flatten the structure as I suggested?

you can work around it by using methods to extract/generate data, that way you have full control. Here's an example:

This would actually solve the problem and would be 'good enough'. However, it would create a lot of boilerplate for a lot of people, I guess.

kgiszczak commented 1 year ago

Thanks for your quick reply! I just edited my initial post again to improve upon it and showing that there is a bit of a problem with nil values in CSV, too.

That actually is documented (https://github.com/kgiszczak/shale#rendering-nil-values). CSV renders nil values by default (as opposed to other formats), because without it, it would mangle the CSV output. render_nil: false means don't include the field/column at all in the output (so in the case of CSV it would mean one comma less).

E.g. if your first_name = nil and last_name = 'Doe' you would get:

first_name,last_name
Doe

If you really need, you can overwrite this behaviour with explicitly setting render_nil: false - but in the case of CSV you would need that in a very rare circumstances.

That is absolutely correct! However, shouldn't default behaviour create a somewhat acceptable and unsurprising result? Just having a JSON string which is created with different rules as the actual to_json result seems kinda wired to me.

That actually is not a JSON but stringified Ruby Hash, as all formats except XML uses this data structure internally. Underneath to_csv is calling as_csv (or as_json, as_yaml and so on) method that returns this hash and then converts it to CSV. E.g.

person.as_csv

# => { 'first_name' => 'John', `last_name` => 'Doe' }

Why not flatten the structure as I suggested?

However, it would create a lot of boilerplate for a lot of people, I guess

I totally get it, ang it's not the first time this was raised. I tried to implement a solution for this problem, but it opend a can with worms with so much edge cases that I gave up. I might try to approch this problem again in the future, but there's no a quick solution for this.

tbsvttr commented 1 year ago

I totally get it, ang it's not the first time this was raised. I tried to implement a solution for this problem, but it opend a can with worms with so much edge cases that I gave up. I might try to approch this problem again in the future, but there's no a quick solution for this.

Thanks for your understanding. Can I find your approach back then somewhere? Maybe we can improve upon it, or maybe it is even enough for our use case.

That actually is not a JSON but stringified Ruby Hash, as all formats except XML uses this data structure internally. Underneath to_csv is calling as_csv (or as_json, as_yaml and so on) method that returns this hash and then converts it to CSV.

Ah, I see. So this is some kind of fallback in case of “nested CSV”.

If you really need, you can overwrite this behaviour with explicitly setting render_nil: false - but in the case of CSV you would need that in a very rare circumstances.

You are absolutely right. Not rendering the nil values in a CSV would break the CSV in most cases. I actually suggested this behaviour for the representation of nested data within a CSV. My comment about it showing nil was about that stringified Hash you use internally (which I misunderstood for some wired JSON).

That actually is documented (https://github.com/kgiszczak/shale#rendering-nil-values). CSV renders nil values by default (as opposed to other formats), because without it, it would mangle the CSV output. render_nil: false means don't include the field/column at all in the output (so in the case of CSV it would mean one comma less).

You are right here, too. However, I feel it is a bit inconsistent to have CSV so different behaviour than anything else. Why not make render_nil: true the default behaviour?

kgiszczak commented 1 year ago

Thanks for your understanding. Can I find your approach back then somewhere? Maybe we can improve upon it, or maybe it is even enough for our use case.

I made pretty good progress on this and actually came up with a solution that I like and works really well. It won't be production ready for at least couple of weeks, but I can push a branch with it for you to try out and hear your feedback on it. I'll do it tomorrow.

Ah, I see. So this is some kind of fallback in case of “nested CSV”.

Not really a fallback, more like a side effect of the underlying CSV parser.

You are right here, too. However, I feel it is a bit inconsistent to have CSV so different behaviour than anything else. Why not make render_nil: true the default behaviour?

Because CSV is much different than JSON, YAML, TOML or XML - in these formats you usually don't want to render empty fields/nodes, in CSV you have to, otherwise you mangle the output.

tbsvttr commented 1 year ago

I made pretty good progress on this and actually came up with a solution that I like and works really well. It won't be production ready for at least couple of weeks, but I can push a branch with it for you to try out and hear your feedback on it. I'll do it tomorrow.

Thanks, that would be very good!

tbsvttr commented 1 year ago

@kgiszczak Just a short reminder. :)

kgiszczak commented 1 year ago

sorry for the delay, I had a lot of work past few days, but I'm almost done, I need one more day to finish tests and write documentation, it'll be ready after the weekend.

kgiszczak commented 1 year ago

Hey @tbsvttr the feature is ready https://github.com/kgiszczak/shale#delegating-fields-to-child-attributes

tbsvttr commented 1 year ago

@kgiszczak Hey, that is very cool! As far as I can see, this solves the problem with relatively little boilerplate code.

Can you release this as v0.10?

kgiszczak commented 1 year ago

I will make a release in a few weeks