martijn / xsv

High performance, lightweight .xlsx parser for Ruby that provides nothing a CSV parser wouldn't
https://storck.io/posts/announcing-xsv-1-0-0/
MIT License
194 stars 20 forks source link

:symbolize_names option for hash mode #26

Closed senhalil closed 2 years ago

senhalil commented 3 years ago

It would fit well with other hash returning functions like JSON.parse and would be of use.

I am not sure but I suspect there would be less memory allocation since currently, each row allocates a new string key for the header labels in the :hash mode.

Similar to https://github.com/martijn/xsv/issues/24 this could be done in two ways (or both? :thinking:) with a symbolize_names: true argument i) to open function and/or ii) to parse_headers! function.

senhalil commented 3 years ago

As a demonstration, this commit https://github.com/senhalil/xsv/commit/d2c10c1dea6a227b74e6c23b50c7b7f42b987b87 makes symbols default without an argument.

I suspect that making symbols default wouldn't be the right thing since symbols are never released and this might be an issue for long running processes that parse xlsx documents. But for short running scripts/workers symbols would decrease memory allocation and speed up comparison.

@martijn I feel like it shouldn't be me who adds this argument (since I don't have the time to consider up/downs, nor the visibility of the user-base) but I thought the commit might simplify the life of the person who will actually implement the argument.

texpert commented 3 years ago

As a demonstration, this commit senhalil@d2c10c1 makes symbols default without an argument.

I suspect that making symbols default wouldn't be the right thing since symbols are never released and this might be an issue for long running processes that parse xlsx documents. But for short running scripts/workers symbols would decrease memory allocation and speed up comparison.

JFI, symbols are garbage collected since Ruby 2.2 - https://bugs.ruby-lang.org/issues/9634

senhalil commented 3 years ago

I should have read the comments of the answer that I shared :facepalm: (I will edit the answer with your link cheers @texpert ) So then there is no downside... Let me know if you want the commit and I will create a PR.

martijn commented 3 years ago

I can see why symbolized keys would be useful or nicer in some Ruby code so I am fine with merging a proposal to make this an option (maybe as a parameter to symbolize_keys!?).

There is no gain in allocations, though. Observe:

irb(main):001:0> x = Xsv::Workbook.open("test/files/10k-sheet.xlsx")
=> #<Xsv::Workbook:960>
irb(main):002:1* sheet = x.sheets[0]
=> #<Xsv::Sheet:980>
irb(main):004:0> sheet.parse_headers!
=> #<Xsv::Sheet:980>
irb(main):005:0> row0 = sheet[0]
=> {"A1"=>"A2", "This sheet has 10.000 rows"=>"This sheet has 10.000 rows", 1=>2, #<Date: 1992-10-04 ((2448900j,0s,0n),+0s,2299...
irb(main):006:0> row1 = sheet[1]
=> {"A1"=>"A3", "This sheet has 10.000 rows"=>"This sheet has 10.000 rows", 1=>3, #<Date: 1992-10-04 ((2448900j,0s,0n),+0s,2299...
irb(main):007:0> row0.keys.first.object_id
=> 1020
irb(main):008:0> row1.keys.first.object_id
=> 1020

For every row, the key objects are shared since we dup the @empty_row for each new row.

senhalil commented 3 years ago

Thanks @martijn, you are right! I tried with memory_profiler and symbol and string both allocates and retains the same amount of memory and type of objects.

Total allocated: 252195568 bytes (860308 objects)
Total retained:  124963 bytes (1967 objects)

allocated memory by gem
-----------------------------------
 205028515  xsv-1.0.1
  47166973  rubyzip-2.3.0
        80  other