thoughtbot / art_vandelay

Art Vandelay is an importer/exporter for Rails 6.0 and higher.
MIT License
73 stars 4 forks source link

Let context be merged into import data #27

Open benjaminwil opened 3 months ago

benjaminwil commented 3 months ago

Hi, hope you all are well.

In some cases, the given import data will not be enough to create valid Active Record objects. Developers may need to seed the import data to allievate errors related to required belongs_to associations, and so on. This commit provides an interface for this.

benjaminwil commented 3 months ago

You're right that the existing interface can support relationships by having a CSV column like user_id to make an association. I will try to better explain why I think this would still be valuable given that fact. I will also explain my intended use case, where having a CSV column like user_id is not possible.

First, here is why I think it would be valuable even if having a column with an ID (like user_id) is a possibility:

  1. A record's ID may not be the same across all environments. This means that you would need to prepare multiple CSV files to test imports across development, test, staging, and production environments.
  2. An end user may not have access to the IDs they need to prepare the data, meaning there may be some back-and-forth between the user and the developer. Plus, the end user must trust the developer to provide the remaining import data without any real ability to check and confirm their work before the import.
  3. If the developer made a self-serve imports UI for the end user to import their own data using Art Vandelay, there is no way for the imports to be scoped to the user's account. (Imagine an application with multiple tenants, or users who manage many accounts.) This leads into my intended use case...

I intend to use Art Vandelay as the import/export backend to a self-serve import/export UI. It allows the end user to import and export the contents of their account(s). The end user does not know their own user_id or account_ids. In order to make this import/export UI safe, I need to merge context like current_user and Account.find_by(name: params[:selected_account_name], administrators: [current_user]).

Maybe a self-serve imports/exports UI is beyond the scope what you envision Art Vandelay being used for. I would love to discuss that further. But even if that is not in your vision of what Art Vandelay should be used for, I think points 1. and 2. still make this a valuable feature.

With regard to point 2., I can imagine a developer setting up a script to be reused for multiple accounts in an application with many tenants:

ACCOUNT_NAME=art_vandelay_industries CSV=AVI.csv rake run_import

ACCOUNT_NAME=j_peterman_catalog CSV=JPC.csv rake run_import

I hope this clarifies why I think this feature is valuable. I'm of course happy to discuss it further.

stevepolitodesign commented 2 months ago

Thank you for the additional context!

A record's ID may not be the same across all environments. This means that you would need to prepare multiple CSV files to test imports across development, test, staging, and production environments.

This is a valid point, but I imagine in this case, you could use a different unique identifier, if available. However, I recognize that not all records have a unique identifier other than the id.

An end user may not have access to the IDs they need to prepare the data, meaning there may be some back-and-forth between the user and the developer. Plus, the end user must trust the developer to provide the remaining import data without any real ability to check and confirm their work before the import.

In this case, wouldn't an end user still need to know something unique about the record, though? In the example provided in the tests, we create a record, but I'd imagine in practice we'd need to find a user. Most likely by their email, but an end user may not always have access to that.

If the developer made a self-serve imports UI for the end user to import their own data using Art Vandelay, there is no way for the imports to be scoped to the user's account. (Imagine an application with multiple tenants, or users who manage many accounts.) This leads into my intended use case...

In this case, I think there's an opportunity to parse the CSV before passing it to ArtVandelay::Import. I say this because with your intended use case, you still need to query for a user.

csv_string = CSV.generate do |csv|
  csv << ["email", "password"]
  csv << ["george@vandelay_industries.com", "bosco"]
  csv << ["kel_varnsen@vandelay_industries.com", nil]
end

csv_data = CSV.parse(csv_string, headers: true)

csv_data.each do |row|
  row["user_id"] = current_user.id
end

result = ArtVandelay::Import.new(:users).csv(csv_string)

Admittedly, the API you're introducing is more direct, but I worry that it might be a premature feature.

benjaminwil commented 2 months ago

In this case, I think there's an opportunity to parse the CSV before passing it to ArtVandelay::Import.

I think this makes sense until the CSV has millions of rows. I expect to be importing CSVs with millions of rows.

I am trying to reduce the amount of boilerplate code I have to manage (per new import type) to safely import large amounts of data (in batches).

I don't know exactly how memory management works when you need to parse and transform a large CSV, so maybe I am wrong, but I think this leads to the developer having to manage a lot of the transformations using the CSV interface.

This pull request is actually just a baby step in the direction I ended up going in my own implementation, where context allows you to specify a proc as a value so you can transform the import data in place.

Here is how I am initializing an import in my application with changes to the gem in my fork, for example:

    ArtVandelay::Import
      .new(Bookmark, filtered_attributes: FILTERED_ATTRIBUTES)
      .csv(
        contents,
        attributes: ATTRIBUTES_MAPPING,
        context: {
          account: account,
          private: ->(value) { value == "yes" }, # A boolean attribute.
          read_at: ->(value) { value == "no" ? Time.now : nil }, # A datetime attribute that is a boolean in the CSV.
          tags: ->(value) { find_or_create_tags_from(value) } # A has_many relationship that is a list of comma-separated tags in the CSV.
        }
      )

I hope this demonstrates the benefits that context could provide. If context is a premature or unwanted feature for the importer, I completely understand and will close this pull request.

stevepolitodesign commented 2 months ago

Thank you for sharing that detailed example. I like the interface you've come up with, and can see its value.

For now, though, I think I want to sit on this for a bit before pursing it. My main concern being that I'm not sure if there's a wider need for thus feature (yet).