Layers - Githubissues

davedoesdemos commented 4 years ago

I don't agree with the concept of layers of the data lake. This misleads customers into thinking they should have astructure around this. Instead we should refer to them as data types for the data sets stored in the lake. As far as datasets are concerned there are effectively two types, finished data sets with a contract on what they deliver, and transitional data sets which are used to build finished data sets. In this setup a transitional data set can be later promoted if it is used by more than one downstream data set.

trinadhkotturu commented 4 years ago

Any specific customer references where this was an issue? Customers are somehow familiar with the layered structure and helps them visualize the move from transitional data set to finished data set. The fact that there can be multiple stages itself in transition stage is lost by reducing it to binary stages.

davedoesdemos commented 4 years ago

Customers are only familiar because they read the docs. As soon as they try to implement layers in the structure things start to go badly wrong. Curated is a step in data processing and has no business being a part of the structure of the lake from my perspective. If you have a justification other than existing docs talk about it then I'd be happy to hear it, but I've never seen or heard a reason you'd have curated or enriched as a specific structural element at the lake level. They are steps in creating a single dataset, but not at the lake structure level. We used to call this sort of thing temporary files ;) Where a curated/enriched data set is used by more than one thing, then it becomes it's own data set and should be managed as such.

trinadhkotturu commented 4 years ago

I am surprised because i was struggling to find corresponding docs in the first place corresponding to the decks i have seen in this area when i joined. Pretty much every deck i saw on Data Lake mentioned these layers and docs didn't mention that.

I agree with you on curated part, but not on the cleansed data part. That is still at lake level, otherwise what i am doing is relying on the source system to push cleansed data, which i am not comfortable with. So, at least one additional layer is coming in between raw data and finished data set.

rukmanigopalan commented 4 years ago

I had a very similar observation as @trinadhkotturu as well, the zones have been a pretty well known pattern of organizing the data. I would recommend thinking of it less as imposing a structure and more of using this as a framework for organizing the data as they get transformed with analytics engines and guiding other decisions such access management. @davedoesdemos - can you please help shed light on the specific confusion that comes out of this structure? I'm resolving this issue as by design. Let us please take this conversation offline to ensure we reach a reasonable middle ground.

davedoesdemos commented 4 years ago

@trinadhkotturu I agree you do need the data cleansed, but this has no relationship with the structure of the data lake. It is based on the data set itself, which is a completely separate topic to data lake structure. @rukmani-msft I agree, hence my suggestion that it shouldn't be a part of the data lake structure documentation. This is very clearly part of data transformation, and muddies the water here for no reason. Many of our customers start off trying to make a container for each of the "layers" which is terrible from a design perspective. Perhaps this is a UK thing, but docs talking interchangably about lake structure and data transformation are not helpful in this regard.

rukmanigopalan / adlsguidancedoc

Layers #5