pcreux / csv-importer

CSV Import for humans on Ruby / Ruby on Rails
MIT License
590 stars 67 forks source link

A normalization feature? #87

Closed Adamantish closed 3 years ago

Adamantish commented 5 years ago

I couldn't find a gem that enabled me to import a flat CSV into a set of has_many associated models so I built a lib myself. It worked very nicely for automatically exposing domain misunderstandings between devs and biz people and runs pretty efficiently.

I was considering whether to get it finished off as a gem but frankly your gem covers many more parsing gotchas, has a nicer DSL and other features so I was wondering if instead you'd welcome a PR?

You might ask "why would you even accept a CSV with such horribly denormalised data?". The short answer is because there was little choice. A longer one would be: When you're trying to establish a data format usable by a wide array of tiny travel agents who don't keep proper database and, yes, have nothing better than names and locations for their own primary keys this is what you must do. To ask them to manually type ids and separate into different spreadsheet tabs would invite much worse mistakes. I would guess that many other ruby devs find themselves in similar situations considering CSV is often the last remaining strategy for integrating non-technical parties.

When the first importer was used, the strict normalizer threw errors exposing facts such as hotel star ratings can be specific to rooms, not just whole hotels.

This PR would be to add a feature that allows you to specify has_many in your model which in turn references another model. Honestly, I wouldn't be able to re-use much of my code because I used a memory hungry approach which loads everything into a hash of hashes on first pass, runs over dev defined column transformations (storing error details as it goes), and then does bulk fetches on the target tables before matching everything up. The consequence is that reimports with few changes run very fast indeed. However, I don't see it as scalable for 10s of thousands of row sets.

In any case, the problem space is fresh in my head right now and I thought I might as well do something open source with it. What do you think? You've already made the step into the database land, why not a step further?

pcreux commented 5 years ago

Hey @Adamantish,

From what I can see, you're importing a list of room + hotels.

I believe that you can make that work by creating a proxy object as described here that takes care of denormalizing one row of Room + Hotel into those two models objects. It would work similarly to a "Form object".

Adamantish commented 5 years ago

Hiya, yes, and in fact the complexity of prices for rooms meant that in my specific case I extended the Room to act as such an object. As advice to gem users however, it relies on an undocumented promise that these are the only active record methods the library is internally using.

In my case there was a last minute requirement change such that there needed to be Supplier => SourceMarket above Accommodation. Because of the simple generalisation of just being able to recursively nest has_many mappings I was able to make that change with only a few more lines of declarative code and no noticeable slow down.

I was able to introduce built-in integrity checking with helpful errors that are business domain agnostic (it's kind of rare that there are no such errors in hand written denormalised data).

In addition, it takes the identifier concept which helpfully handles matching up records on the top level root records in this gem and extends it to the lower has_many objects, enhancing consistency.

If anybody said that a library shouldn't do too many things I'd say I like a library like yours that handles parsing, enriched error reporting, object mapping and persistence.

Adamantish commented 3 years ago

Looking back, I see I overwhelmed the description with implementation details and buried the most important point.

Indeed, what you suggested with proxy objects is the simplest way and my more performance optimized approach represents an expansion of scope from this gem here. It bulk loads from the database the full object graph for the ids addressed in the file and diffs with the graph of the CSV then updates, inserts and (optionally) deletes only where necessary. Connections between models are also done efficiently, rather than a per-row find_by. (Although this full load is pretty naive in terms of RAM. That would need solving if it were to be open sourced.)

The idea was to make it practical for our managers to have the workflow they want: one big upload at the start of the year while being able to make quick incremental updates here and there.

I appreciate that you'd probably advise to then use your gem to populate an intermediate table and do the diffing with that. Of course, if you can just map CSV to real production tables and it just works that's a nice thing to have. Anyway, yes I'm describing a much more complex library and what I wrote would only work for strictly DAG shaped graphs of relations anyway. Such a library would be useful but it would deserve to be decoupled into quite separate smaller libraries with a top-level one to glue them in this useful way.

It's perhaps relevant to say that if you're trying to use a CSV file as a fully-fledged admin interface in this almost magical way, something is going wrong. I had advised that this deserved a real user interface rather than CSV and when I delivered the finished thing they realised how painful it is to work through the range of validation errors and corner cases so changed their minds and devoted months to building a nice UI instead.

pcreux commented 3 years ago

Closing this ticket as there's hasn't been any updates in... years! 😅