shuijian-xu / hive

0 stars 0 forks source link

Integration #66

Open shuijian-xu opened 5 years ago

shuijian-xu commented 5 years ago

Some of the issues relating to integration were described in the definition of data warehouses. There are, in fact, two main aspects of integration:

  1. Format integration

  2. Semantic integration

shuijian-xu commented 5 years ago

Format Integration

The issue of format integration is concerned mainly with ensuring that domain integrity is restored where it has been lost. In most organizations, there are usually many cases where attributes in one system have different formats to the same, or similar attributes in other systems. For example:

  1. Bank account numbers or telephone numbers might be stored as type “String” in one system and type “Numeric” in others.

  2. Sex might be stored as “male,”/“female,” “m”/“f,” “M”/“F” or even 1,0.

  3. Dates, as previously described can be held in many formats including “ddmmyy,” “ddmmyyyy,” “yymmdd,” “yyyymmdd,” “dd mon yy,” “dd mon yyyy.” These are just a few examples. Some systems store dates as a time stamp that is accurate to thousandths of a second. Others use an integer that is the number of seconds from a particular time, for example, 1 Jan 1900.

  4. Monetary attributes are also a problem. Some systems store money as integer values and expect the application to insert the decimal points. Others have embedded decimal places.

  5. In different systems, differing sizes are used for string values such as names, addresses, and product descriptions, etc.

The integration procedure consists of a series of rules that are designed to ensure that the data that is loaded into a data warehouse is standardized. So all the dates are the same format, monetary values are always represented the same way, etc.

Why is this important?

Imagine you are attempting to use the data warehouse and you want a list of all employees grouped by sex, age, and average salary where none of those attributes were properly standardized.

It would be a difficult enough task for an experienced programmer to undertake. It would be next to impossible for a non-IT person.

Also, a data warehouse accepts information from a variety of sources into a single table. This is feasible only if the data is of a consistent type and format.

The integration rule set is used as a kind of map that specifies how information that has been extracted from the source systems has to be converted before it is allowed into the data warehouse.

shuijian-xu commented 5 years ago

Semantic Integration

As you know, semantics concerns the meaning of data. Data warehouses draw their data from many different operational systems. Typically, different operational systems are used by different people in an organization. This is logical since financial systems are most likely to be used by the accounts department, whereas stock control systems will be used by warehouse (real warehouse, not data warehouse) staff.

Think back to the discussion we had about what a sale is? This kind of ambiguity exists in all organizations and can be very confusing to a database analyst trying to understand the kinds of information being held in systems. The problem is compounded because, often, the users of the systems and the information are unaware of the problem. In everyday conversations, the fact that they are unknowingly discussing different things may not be obvious to them, and most of the time there are no serious repercussions.

In building a data warehouse, we do not have the luxury of being able to ignore these usually subtle differences in semantics because the information produced by the queries that exercise the data warehouse will be used to support decision making at the highest levels in the organization.

It is vital, therefore, that each item of data that is inserted into the warehouse has a precise meaning that is understood by everyone. To this end, a data warehouse must provide a catalog of information that precisely describes each attribute in the warehouse. The catalog is part of the warehouse and is, in fact, a repository of data that describes other data. This “data about data” is usually referred to as metadata. The subject of metadata is important and is described in more detail later on.