mglaman / drupal-typed-data-by-example

Drupal's Typed Data API by example
GNU General Public License v3.0
37 stars 6 forks source link

Explain how TypeData can satisfy a Data Sync story #17

Open cosmicdreams opened 2 years ago

cosmicdreams commented 2 years ago

Stop me if you've heard this before.

I have products in my Drupal site, and products in another (maybe not Drupal) site. I want to run a process and sync the product data. Over time, I've learned that you can't just fire and forget that process. There is a strong need to have a series of reports that shows data that isn't properly synced, as well as vigorously report on any issues with the sync so we can fix whatever issue is leading to the data not syncing well. So that's the use case.

When creating data sync reports it's important to answer the following questions:

and maybe I could optimize performance by doing these checks during the process:

There are many programmatic approaches to solving the above questions. It would be nice to see a solution that used TypeData. I am imagining a solution that takes data from each datasource and converts them into a common data type so that:

If you've got something like this already thought through, I'd say run with that. If not, I'm eager to help write some documentation on how to do this...as soon as I figure it out.

mglaman commented 2 years ago

I'll spend some time to give a proper reply, but at a quick skim, here's a thought/question.

Your concern is about sourcing data and ensuring it's consistency.

The Typed Data API would be used at the granular level of each object being processed. Not the level at which it is sourced and then stored.

Like I said, I'll read again for a more in depth reply.

cosmicdreams commented 2 years ago

I think I understand what you're saying, and I think that's fine.

When considering Data Warehouse based data analysis and reporting best practices, there is value in merely "transporting" data from a remote datasource to a local datasource. If you are debugging a full ETTL (Extract, Transport, Transform, Load) process it's important to know where data quality issues occur. If the problem was with the original data, then you can fix the problem at the remote (like how data is being captured). But if the problem really is how you are improperly transforming the data then you need to fix it on your end.

All that said, I think TypedData can help by converting the remote and the local data into a common intermediate state that retains all the data available and any calculated or extra fields needed to help make reporting decisions. In order to fit the specific reports we want, the data will be transformed again to become the associative array we can give to a #table or #tableselect element. For performance, could cache that intermediate state so that multiple reports can use the same data without needing to redo API calls or expensive Node->loads.

cosmicdreams commented 2 years ago

I recently build just such a process for a project, I'm just now starting to think about how to modify it to include TypedData. With everything else I have going on I think I might have a week early in the year to explore, then will have to move onto other tasks.

mglaman commented 2 years ago

If you are debugging a full ETTL (Extract, Transport, Transform, Load) process it's important to know where data quality issues occur. If the problem was with the original data, then you can fix the problem at the remote (like how data is being captured). But if the problem really is how you are improperly transforming the data then you need to fix it on your end.

:) This is something I'm working on but it's not public, yet.

You should have the Transform as part of your integration with Typed Data API. Create a data definition that maps your source expectations and your destination expectations. This way you can do validation against the source value to your data definition and see if it's expected. Ditto for the destination.

That's one problem with Migrate, right now, is validation and error handling.

mglaman commented 2 years ago

Source your data however you need. Convert it from JSON, CSV, YAML, etc to an array. Then write a schema for what the object looks like in each "row". Perform validations on each object/row. Then you can find errors more easily and maybe skip processing of that one and continue forward in your processing pipeline.

cosmicdreams commented 2 years ago

OK, it sounds like my thinking is going down the right path then. I wish I could like, upload a diagram of the flow to help explain what I've made so far.

Next Steps:

Expected Goals:

cosmicdreams commented 2 years ago

Getting back the original feature request. I think I can help building some code to help explain the use case.

Key bits along the way would include:

  1. Pull product data into a TypeData enhanced object (I suggest we call it Datarow and not Dimension otherwise we might confuse the data warehouse nerds)
  2. Show code that pulls data from a remote endpoint into the same object.
  3. Show examples of comparison logic, and the helpful book-keeping properties that we would want to include: (source_name, destination_report, and others)

Being able to show a use case such as this may be helpful to the migrate developers. It might spark an idea for them to include support for custom reporting on migration runs.

cosmicdreams commented 2 years ago

Do you have a "Map" example sounds like that needs to be handled with special attention. Sounds like it's close the array situation I'm looking for.