snowplow / enrich

Snowplow Enrichment jobs and library
https://snowplowanalytics.com
Other
21 stars 39 forks source link

Add failure entity for incomplete events #886

Closed spenes closed 7 months ago

spenes commented 7 months ago

Failure entities are started to be attached to incomplete events as derived contexts.

There are some cases that we need more information than in the bad rows to create failure entities therefore it isn't possible to create failure entities from bad rows directly. For this purpose, we created wrapper classes to attach extra information about failure entities.

spenes commented 7 months ago

Thanks for the review @benjben!

I see a few problems with the approach you suggested. First, I think it is hard to reconstruct BadRow from FailureEntitys. As you know, there can be multiple FailureEntity for one BadRow. We need some kind of mechanism to map FailureEntitys to their BadRow. Other problem I see with that approach is that we would be starting to redefine BadRows as FailureEntity in this case just to add a few extra bits. I am not sure it is good idea to duplicate BadRow definitions in the Enrich codebase. Of course, other solution might be add those extra bit of information to original BadRowdefinitions but I am not sure we want to go that route at this stage since that would make things more complicated.

I agree wrappers doesn't look pretty but this is kinda necessary workaround for this feature.

SchemaViolationWithExtraContext contains only a few extra bits additional to SchemaViolation to create FailureEntity. This will be exactly what we will do even if define completely new FailureEntity with structure almost same with SchemaViolation itself.

I agree with your comment about BadRowWithFailureEntities but don't we need both BadRow and List[FailureEntity] in the end ? We can postpone their creation with creating intermediate format that can be converted to both but we need to create both eventually.

benjben commented 7 months ago

As you know, there can be multiple FailureEntity for one BadRow. We need some kind of mechanism to map FailureEntitys to their BadRow

There are 3 steps that can produce a bad row:

  1. Validating the input
  2. Enriching
  3. Validating the output

Each of these steps would now return a List[FailureEntity] instead of a BadRow. We know that all the FailureEntitys for a step are for the same bad row, so we just need to pattern match on the first one and we'll know the type of bad row to use. The left part of the IorT would then be NonEmptyList[List[FailureEntity]].

With this approach it is all about the mapping function from FailureEntity to FailureDetails. We get rid of all the wrappers.

Other problem I see with that approach is that we would be starting to redefine BadRows as FailureEntity

No, we'd leave BadRow intact. What do you think that we'd need to change it ?