python-bonobo / bonobo

Extract Transform Load for Python 3.5+
https://www.bonobo-project.org/
Apache License 2.0
1.59k stars 146 forks source link

Add support for reading and writing avro files #367

Open juarezr opened 4 years ago

juarezr commented 4 years ago

Add support for reading and writing Avro files using FastAvro.

Avro is faster and safer than other format as CSV, JSON or XML.

As Avro is typed, the fields types are detected from values. Once bonobo starts preserving types, they could be used for determining field types.

Tested with the workflow mysql -> sqlalchemy -> bonobo -> avro.

Publishing now for gattering sugestions.

lgtm-com[bot] commented 4 years ago

This pull request introduces 2 alerts when merging 078fcccc973c01a2ab9fb510fb6121e7e8f5af7c into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging d62465adef183228583becfca5e5d20c2b849362 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

hartym commented 4 years ago

Hello,

This should be an optional dependency, I don't think 90% of users want to install this dep ifthey are not going to use it.

Thanks a lot for your contribution, what's the status of the code ? Should I test that already ?

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging adc443173c3076408823d3c08bd381ad7e995e96 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

juarezr commented 4 years ago

Regarding status of code:

Regarding optional dependency:

If you want testing, just pip install fastavro in a environment and replace a CvsWriter by a AvroWriter.

What you think about?

Thanks

hartym commented 4 years ago

Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.

Bonobo uses python's type system, which does not allow implicit conversions.

For solving this the code has two strategies: a fields property for specifying the type of the fields automatic type dectection based on first row/record

Not sure what "code" you're talking about, this is not far from what bonobo does (see NodeExecutionContext).

It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something. I'll have a try as soon as possible.

Thanks !

juarezr commented 4 years ago

Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.

Bonobo uses python's type system, which does not allow implicit conversions.

Let me explain in more details what I tried to tackle.

For writing to a avro file one should define a schema like this:

schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'day', 'type': 'int', 'logicalType': 'date'},
        {'name': 'time', 'type': 'long', 'logicalType': 'time-micros'},
        {'name': 'temp', 'type': 'int'},
        {'name': 'umidity', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 4, 'scale': 2},
    ],
}

The workflow that I was trying to hadle is the most common in ETL processing:

  1. Querying a database in a RDBMS with a query like select * from mytable
  2. Doing some transformation in the rows of data: a. adding, converting, joining and splitting fields from a row b. joining and splitting a row in a one-to-many transformation c. joining data from other flow/file/query into the row (like descriptions of a key)
  3. Writing the rows/records to the avro file with the exact or aproximated field types

However, in the step 3, there wasn't type information for creating the required type schema.

The problem happens because the python's type system could be not enough to represent the richness of types of a RDMS database:

Using the current values types give by bonobo I got working the following types:

But I suffered with the following types:

I tried to solve this by two ways:

  1. Creating a property in AvroWriter for the developer specifing a list of types
  2. Trying to autodetect the types from first row return by bonobo. Here I suffered from reduced type information.

I'm looking for the best way to handle this issue.

juarezr commented 4 years ago

It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something.

I agree with that. I just would like to settle the design before committing time writing more tests.

Thanks

juarezr commented 4 years ago

For reproducing the issue I did the following:

  1. created a database in remotemysql.com
  2. executed the script bellow for creating a table with some recores
  3. create a graph in bonobo using: a. connected bonobo in the remote database with SqlAlchemy b. executed a query select * from testing in a Bonobo graph node c. used AvroWriter for writing for a avro file
  4. used debugger to inspect the resulting types

Notice that most types mapped to int, float, string, bytes, and datetime.datetime


create table testing (
    fboolean    BOOLEAN,
    fchar       CHAR(10),
    fvarchar    VARCHAR(10),
    ftext       TEXT(10),
    ftinyint    TINYINT,
    fsmallint   SMALLINT,
    fmediumint  MEDIUMINT,
    fint        INT,
    fbigint     BIGINT,
    fdecimal    DECIMAL(9,2),
    ffloat      FLOAT,
    fdouble     DOUBLE,
    fbit        BIT,
    fdate       DATE,
    ftime       TIME,
    fdatetime   DATETIME,
    ftimestamp  TIMESTAMP,
    fyear       YEAR
);

insert into testing values(
    TRUE, 
    'abcdef', 'ghijkl', 'mnopqr', 
    1, 123, 32000, 66000, 1234567890,
    123.456, 456.789, 123.789, 1, 
    '2019-12-25', '21:22:23', '2019-12-25 21:22:23', '2019-10-25 17:22:23', 
    2019
);

insert into testing values(
    false, 
    'vidi', 'vini', 'vinci', 
    2, 121, 32023, 66066, 9876543210,
    234.567, 567.890, 234.890, 0, 
    '2019-12-15', '15:22:23', '2019-12-15 16:22:23', '2019-10-15 17:15:23', 
    2018
);
hartym commented 4 years ago

I understand your point.

It should be possible to use different types than builtins, like for example one could use decimals (https://docs.python.org/3/library/decimal.html) to avoid wrong numbers on payckeck or numpy types to have rightly sized variables. There are two ways to do so and I think (but I may be wrong) it's not bonobo job to handle this (or at least, not more than providing a way to let the user do it.

Either your data producer already knows how to produce those types as an output (a db driver that would yield numpy integers, for example). In that case, job's already done, and bonobo will just pass those values through. Either your data producer produces other types (assuming they do not contain unwantable approximations) and you can have a node in charge of casting things. This is of course less effective, but may still work in certain situations as it will free up memory waste for further processing, and there should be a limited amount of rows waiting to be converted. This is already something you can do in a node.

So as I see it (but let me know if I'm wrong, you may have thought more of this), there is one "correct" way which is the responsibility of whatever talks with the data source, and one workaround which is possible.

Am I missing something ?

Or are you suggesting that you would need some metadata information storage about columns ?

juarezr commented 4 years ago

Thanks for the summarizing the implications around the my point.

After reading that, I've found some new things about the issues:

juarezr commented 4 years ago

Regarding the metadata information storage about columns, I am thinking that the importance of preserving the column types depends on the type of output and the level of effort and complexity of the development and use.

So the source column type information will/wont matter according the use case/scenario.

For instance considering only the output/destination:

  1. It wont matter for transferring data to untyped file formats like CSV, Json, XML, fixed or other text based output, because the is mandatory to convert the values to string/text.
  2. It wont matter for transferring data from a dbms/database table/query to another dbms/database because the target table will already have the columns with defined types and a implicit type conversion will happen.
  3. It will matter for transferring data to typed file formats like Avro, Parquet, ORC, Protobuf, thrift or other binary based output, because the is mandatory to specify the column types.

For instance considering only the effort and complexity of the translation:

The most common ETL use cases I now are:

Basically we have a combination between:

Considering all this it's possible to have some decisions like:

  1. Maintain the current as-is behavior and let the ETL developer specify for each transfer the types all times.
  2. Try to help the ETL developer detecting the types and let him handle when it not fits well.
  3. Create a new transfer plane for exchanging metadata between capable producers and capable consumers.

    One could think that this solutions:

    • 1 could be a valid and righfull decision as it will avoid complexity and reduce source code size anbd cluttering.
    • 3 would be a ideal scenario, but it could also be very laborius/ardous.
    • 2 could be a interesting approach and even an intermediary step to 3.
    • 2 may be a necessary measure to handle the transfer from no capable producers and capable consumers.

What you think about? Would be desired or acceptable any effort regarding having this in bonobo?

hartym commented 4 years ago

Hey.

Thanks for the detailed analysis/explanation.

From what I understand, I think that all use cases are already handled by ... python type system :) Let's say we have int16 and int32 types understood by some consumer. Then the only need is to have some type (as in python type) that contains enough metadata for the consumer to output the correct type.

There are two things that are not done by bonobo (but I pretty much think it's not its responsibility, although you're welcome to prove me wrong) :

Do you have concrete cases that cannot be handled this way?

Also, I think you should focus on avro-only test cases, as if we are able to produce whatever the avro-related nodes expect and we ensure the said nodes are indeed working correctly, it does not matter to know what kind of node (or graph) produced the data. Not sure this is something you tried to do in your tests but as you're describing cases using remote databases, I prefer to state this, sorry if it's useless.

Sorry I still did not find the time to actually test what your code does but I'm on it as soon as possible, if you focus the merge request on having avro readers/writers using optional dependency (and with tests :p), I think we can integrate it pretty soon.

If you think that from the discussion another topic is worth considering, maybe we should open specific issues to discuss it ?

Thanks

juarezr commented 4 years ago

Hi,

Regarding focus I planning to continue in the following way:

  1. Change the code in pull request for: a. Allowing the user/dev directly specify the desired schema. b. In absense of a specified schema try to create one from python types as best as one can.
  2. Test a little bit with other RDBMS like Postgresql, MSSQL, Firebird for catching more issues with the design.
  3. Write code for some tests cases simulating basic functionality like extraction from RDBMS.
  4. Add some documentation to bonobo docs regarding output to Avro.
  5. Further discuss the types mapping/remapping in another issue as sugested.

Also I need some help regarding the best way of:

Resuming the discussion, regarding type mapping I pretty much agree with your conclusions:

What I'm think is worth exploring in bonobo for the type mapping is a simpler solution like:

  1. Let bonobo and python type system continue handling field's values as is today.
  2. Do not map or convert values besides what producers have already done. This way we avoid losing precision and corrupting data.
  3. Explore ways of extracting original type information from producers that are capable of that, like RDBMS/SQLAlchemy.
  4. Allow writers, like Avro, access this type information for automatically creating their typed fields when needed.

For instance, if I have a table in MySql created like this:

create table wrong_types (
    ftinyint    TINYINT,
    fsmallint   SMALLINT,
    fmediumint  MEDIUMINT,
    fint        INT,
    fdecimal    DECIMAL(9,2),
    ffloat      FLOAT,
    fbit        BIT
);

Today without extra type information besides python types it's only possible to create a schema like:

schema = {
  'fields': [
      {'name': 'ftinyint', 'type': 'long'},
      {'name': 'fsmallint', 'type': 'long'},
      {'name': 'fmediumint', 'type': 'long'},
      {'name': 'fint', 'type': 'long'},
      {'name': 'fdecimal', 'type': 'double'},
      {'name': 'ffloat', 'type': 'double'},
      {'name': 'fbit', 'type': 'bytes'},
  ],
  ...
  }

But knowing the type information one could create a better, smaller and faster schema like:

schema = {
  'fields': [
      {'name': 'ftinyint', 'type': 'int'},
      {'name': 'fsmallint', 'type': 'int'},
      {'name': 'fmediumint', 'type': 'int'},
      {'name': 'fint', 'type': 'int'},
      {'name': 'fdecimal', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 9, 'scale': 2},
      {'name': 'ffloat', 'type': 'float'},
      {'name': 'fbit', 'type': 'boolean'},
  ],
  ...
  }

The biggest offensor there is the mapping of DECIMAL(9,2) and BIT because it causes loss of precision and incorrect type handling. Most of other types causes only record/file size increases.

Should we continue this discussion in a separated issue?

lgtm-com[bot] commented 4 years ago

This pull request introduces 3 alerts when merging 145de5ad4a69d916402a8b80538d12df7092a62b into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

lgtm-com[bot] commented 4 years ago

This pull request introduces 3 alerts when merging 165575dca601b6dc7a473555432a76bd13ff0cf2 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

juarezr commented 4 years ago

Hi,

Can you review this pull request?

I think that I reached the point for starting the integration.

thanks

lgtm-com[bot] commented 4 years ago

This pull request introduces 2 alerts when merging 3fd0c889aa6899d334de100379be3fe129ee9181 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

juarezr commented 4 years ago

Any news about this PR? Thanks