Open cosmicdreams opened 2 years ago
I'll spend some time to give a proper reply, but at a quick skim, here's a thought/question.
Your concern is about sourcing data and ensuring it's consistency.
The Typed Data API would be used at the granular level of each object being processed. Not the level at which it is sourced and then stored.
Like I said, I'll read again for a more in depth reply.
I think I understand what you're saying, and I think that's fine.
When considering Data Warehouse based data analysis and reporting best practices, there is value in merely "transporting" data from a remote datasource to a local datasource. If you are debugging a full ETTL (Extract, Transport, Transform, Load) process it's important to know where data quality issues occur. If the problem was with the original data, then you can fix the problem at the remote (like how data is being captured). But if the problem really is how you are improperly transforming the data then you need to fix it on your end.
All that said, I think TypedData can help by converting the remote and the local data into a common intermediate state that retains all the data available and any calculated or extra fields needed to help make reporting decisions. In order to fit the specific reports we want, the data will be transformed again to become the associative array we can give to a #table or #tableselect element. For performance, could cache that intermediate state so that multiple reports can use the same data without needing to redo API calls or expensive Node->loads.
I recently build just such a process for a project, I'm just now starting to think about how to modify it to include TypedData. With everything else I have going on I think I might have a week early in the year to explore, then will have to move onto other tasks.
If you are debugging a full ETTL (Extract, Transport, Transform, Load) process it's important to know where data quality issues occur. If the problem was with the original data, then you can fix the problem at the remote (like how data is being captured). But if the problem really is how you are improperly transforming the data then you need to fix it on your end.
:) This is something I'm working on but it's not public, yet.
You should have the Transform as part of your integration with Typed Data API. Create a data definition that maps your source expectations and your destination expectations. This way you can do validation against the source value to your data definition and see if it's expected. Ditto for the destination.
That's one problem with Migrate, right now, is validation and error handling.
Source your data however you need. Convert it from JSON, CSV, YAML, etc to an array. Then write a schema for what the object looks like in each "row". Perform validations on each object/row. Then you can find errors more easily and maybe skip processing of that one and continue forward in your processing pipeline.
OK, it sounds like my thinking is going down the right path then. I wish I could like, upload a diagram of the flow to help explain what I've made so far.
Next Steps:
Expected Goals:
Getting back the original feature request. I think I can help building some code to help explain the use case.
Key bits along the way would include:
Being able to show a use case such as this may be helpful to the migrate developers. It might spark an idea for them to include support for custom reporting on migration runs.
Do you have a "Map" example sounds like that needs to be handled with special attention. Sounds like it's close the array situation I'm looking for.
Stop me if you've heard this before.
I have products in my Drupal site, and products in another (maybe not Drupal) site. I want to run a process and sync the product data. Over time, I've learned that you can't just fire and forget that process. There is a strong need to have a series of reports that shows data that isn't properly synced, as well as vigorously report on any issues with the sync so we can fix whatever issue is leading to the data not syncing well. So that's the use case.
When creating data sync reports it's important to answer the following questions:
and maybe I could optimize performance by doing these checks during the process:
There are many programmatic approaches to solving the above questions. It would be nice to see a solution that used TypeData. I am imagining a solution that takes data from each datasource and converts them into a common data type so that:
If you've got something like this already thought through, I'd say run with that. If not, I'm eager to help write some documentation on how to do this...as soon as I figure it out.