martijn / xsv

High performance, lightweight .xlsx parser for Ruby that provides nothing a CSV parser wouldn't
https://storck.io/posts/announcing-xsv-1-0-0/
MIT License
194 stars 20 forks source link

Enhance parsing by headers #32

Closed a1tavista closed 2 years ago

a1tavista commented 3 years ago

Hello there! Thank you for the gem, you made a really good job!

I wanna suggest one feature that I can implement as a contributor if we'll decide to do that. So, there's the subject.

In a gem called roo (definitely you know about it), there's a very good feature that allows passing a hash by a set of headers. My team uses roo for parsing datasheets with headers in Russian, and then we use the content of a datasheet to create some AR entities for example.

In roo it looks like:

SET_OF_HEADERS = {
  name: /Название организации|Название/i,
  inn: /ИНН/i,
  kpp: /КПП/i
}.freeze

xlsx = Roo::Excelx.new(filepath)
raw_data = xlsx.sheet(0).parse(SET_OF_HEADERS)

raw_data.first.keys # => [:name, :inn, :kpp]

That allows you to define the keys of your data items so there is no need to transform the keys of every hash to pass data to the next method for example. And with that feature, you can also automatically detect the offset between the first significant row of your data and some blank space, because sometimes docs that we parse looks like this:

image

As you can see, there are two rows that shouldn't be present in parsed data – it just the information to one who works with this template on how to fill rows.

So if this interesting for you I could contribute some time to implement that feature in xsv too.

Best regards.

martijn commented 3 years ago

Thanks for the feedback and suggestion.

I would gladly merge something like this. It seems to me that a header_translations parameter on the parse_headers! method would be the way to go. Please submit some tests with your code so we can ensure the feature does not break in the future.

a1tavista commented 3 years ago

Thank you for your answer, then I'm going to implement this functionality in the near future 🙂

martijn commented 3 years ago

Looking forward to it!