Add support for reading and writing avro files

juarezr commented 4 years ago

Add support for reading and writing Avro files using FastAvro.

Avro is faster and safer than other format as CSV, JSON or XML.

As Avro is typed, the fields types are detected from values. Once bonobo starts preserving types, they could be used for determining field types.

Tested with the workflow mysql -> sqlalchemy -> bonobo -> avro.

Publishing now for gattering sugestions.

lgtm-com[bot] commented 4 years ago

This pull request introduces 2 alerts when merging 078fcccc973c01a2ab9fb510fb6121e7e8f5af7c into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method
1 for Unused local variable

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging d62465adef183228583becfca5e5d20c2b849362 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method

hartym commented 4 years ago

Hello,

This should be an optional dependency, I don't think 90% of users want to install this dep ifthey are not going to use it.

Thanks a lot for your contribution, what's the status of the code ? Should I test that already ?

lgtm-com[bot] commented 4 years ago

This pull request introduces 1 alert when merging adc443173c3076408823d3c08bd381ad7e995e96 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method

juarezr commented 4 years ago

Regarding status of code:

Code is working
- The reading and writing of avro files is working fine
- Tested with a pipeline querying a remote mysql database and writing to avro files
- It has 2 tests for basic reading and writing
- Missing user documentation and more test cases
Some guidance would be appreciated on:
- Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.
- A case for type preserving is data transfer between RDBMS servers.
- For solving this the code has two strategies:
- a fields property for specifying the type of the fields
- automatic type dectection based on first row/record

Regarding optional dependency:

I agree that avro support should be optional
The strategy that I was thinking is:
- if you don't use AvroReader/AvroWriter you are not affected
- if you use, you should also pip install fastavro. Otherwise it would fail
Now I am fighting with the CI tests on github missing the fastavro lib.

If you want testing, just pip install fastavro in a environment and replace a CvsWriter by a AvroWriter.

What you think about?

Thanks

hartym commented 4 years ago

Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.

Bonobo uses python's type system, which does not allow implicit conversions.

For solving this the code has two strategies: a fields property for specifying the type of the fields automatic type dectection based on first row/record

Not sure what "code" you're talking about, this is not far from what bonobo does (see NodeExecutionContext).

It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something. I'll have a try as soon as possible.

Thanks !

juarezr commented 4 years ago

Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.

Bonobo uses python's type system, which does not allow implicit conversions.

Let me explain in more details what I tried to tackle.

For writing to a avro file one should define a schema like this:

schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'day', 'type': 'int', 'logicalType': 'date'},
        {'name': 'time', 'type': 'long', 'logicalType': 'time-micros'},
        {'name': 'temp', 'type': 'int'},
        {'name': 'umidity', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 4, 'scale': 2},
    ],
}

The workflow that I was trying to hadle is the most common in ETL processing:

Querying a database in a RDBMS with a query like select * from mytable
Doing some transformation in the rows of data: a. adding, converting, joining and splitting fields from a row b. joining and splitting a row in a one-to-many transformation c. joining data from other flow/file/query into the row (like descriptions of a key)
Writing the rows/records to the avro file with the exact or aproximated field types

However, in the step 3, there wasn't type information for creating the required type schema.

The problem happens because the python's type system could be not enough to represent the richness of types of a RDMS database:

Using python float (double) for handling values like money, or units of measure could lead to roundings and imprecisions that affect the values.
The value 0.1 (10%) and several others does not exists in floating point representation.
Any reasonable person will complain about wrong values in their pay check, for example.
The issue does not show up with CSV because everything is converted to ascii text.
But formats like Avro, Parquet, ORC, Protobuf, etc... were designed for exchanging field types because developers missed/suffered with the lost of information and performance of CSV.

Using the current values types give by bonobo I got working the following types:

string
float (double, 64bits)
int (int64)
bytes (blob)

But I suffered with the following types:

integer types: int8, int16, int32 as storage size matters
half float: float32 promoted to float64
exact types: decimal(prec,dec), numeric(prec,dec), money promoted to float64
date, time, datetime and timezone: worked with some adaptation
bit, boolean: got mapped to int

I tried to solve this by two ways:

Creating a property in AvroWriter for the developer specifing a list of types
Trying to autodetect the types from first row return by bonobo. Here I suffered from reduced type information.

I'm looking for the best way to handle this issue.

juarezr commented 4 years ago

It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something.

I agree with that. I just would like to settle the design before committing time writing more tests.

Thanks

juarezr commented 4 years ago

For reproducing the issue I did the following:

created a database in remotemysql.com
executed the script bellow for creating a table with some recores
create a graph in bonobo using: a. connected bonobo in the remote database with SqlAlchemy b. executed a query select * from testing in a Bonobo graph node c. used AvroWriter for writing for a avro file
used debugger to inspect the resulting types

Notice that most types mapped to int, float, string, bytes, and datetime.datetime


create table testing (
    fboolean    BOOLEAN,
    fchar       CHAR(10),
    fvarchar    VARCHAR(10),
    ftext       TEXT(10),
    ftinyint    TINYINT,
    fsmallint   SMALLINT,
    fmediumint  MEDIUMINT,
    fint        INT,
    fbigint     BIGINT,
    fdecimal    DECIMAL(9,2),
    ffloat      FLOAT,
    fdouble     DOUBLE,
    fbit        BIT,
    fdate       DATE,
    ftime       TIME,
    fdatetime   DATETIME,
    ftimestamp  TIMESTAMP,
    fyear       YEAR
);

insert into testing values(
    TRUE, 
    'abcdef', 'ghijkl', 'mnopqr', 
    1, 123, 32000, 66000, 1234567890,
    123.456, 456.789, 123.789, 1, 
    '2019-12-25', '21:22:23', '2019-12-25 21:22:23', '2019-10-25 17:22:23', 
    2019
);

insert into testing values(
    false, 
    'vidi', 'vini', 'vinci', 
    2, 121, 32023, 66066, 9876543210,
    234.567, 567.890, 234.890, 0, 
    '2019-12-15', '15:22:23', '2019-12-15 16:22:23', '2019-10-15 17:15:23', 
    2018
);

hartym commented 4 years ago

I understand your point.

It should be possible to use different types than builtins, like for example one could use decimals (https://docs.python.org/3/library/decimal.html) to avoid wrong numbers on payckeck or numpy types to have rightly sized variables. There are two ways to do so and I think (but I may be wrong) it's not bonobo job to handle this (or at least, not more than providing a way to let the user do it.

Either your data producer already knows how to produce those types as an output (a db driver that would yield numpy integers, for example). In that case, job's already done, and bonobo will just pass those values through. Either your data producer produces other types (assuming they do not contain unwantable approximations) and you can have a node in charge of casting things. This is of course less effective, but may still work in certain situations as it will free up memory waste for further processing, and there should be a limited amount of rows waiting to be converted. This is already something you can do in a node.

So as I see it (but let me know if I'm wrong, you may have thought more of this), there is one "correct" way which is the responsibility of whatever talks with the data source, and one workaround which is possible.

Am I missing something ?

Or are you suggesting that you would need some metadata information storage about columns ?

juarezr commented 4 years ago

Thanks for the summarizing the implications around the my point.

After reading that, I've found some new things about the issues:

The issue with precision loss could be a because the fact that the DBAPI returns floats and not decimals is a database/driver/DBAPI issue. But I need further investigate with other MySql drivers, Postgresql, etc...
SQLAlchemy provides abstractions for most common database data types, and a mechanism for specifying your own custom data types.

juarezr commented 4 years ago

Regarding the metadata information storage about columns, I am thinking that the importance of preserving the column types depends on the type of output and the level of effort and complexity of the development and use.

So the source column type information will/wont matter according the use case/scenario.

For instance considering only the output/destination:

It wont matter for transferring data to untyped file formats like CSV, Json, XML, fixed or other text based output, because the is mandatory to convert the values to string/text.
It wont matter for transferring data from a dbms/database table/query to another dbms/database because the target table will already have the columns with defined types and a implicit type conversion will happen.
It will matter for transferring data to typed file formats like Avro, Parquet, ORC, Protobuf, thrift or other binary based output, because the is mandatory to specify the column types.

For instance considering only the effort and complexity of the translation:

For 1 and 2, it's not worth investing time and effort because the current behavior already suits well.
For the 3 I think the handling will depend in how one decide in balancing the effort needed for developing and the burden/benefits of solving automatically between:
- bonobo library and bonobo developer
- ETL developer/bonobo user

The most common ETL use cases I now are:

Transfer some database table/query rows into files like CSV or typed format.
Transfer the exported file into another database table.
Transfer some database table/query rows directly into another database table.

Basically we have a combination between:

non capable producers vs non capable consumers
capable producers vs non capable consumers
capable producers vs capable consumers
non capable producers vs capable consumers

Considering all this it's possible to have some decisions like:

Maintain the current as-is behavior and let the ETL developer specify for each transfer the types all times.
Try to help the ETL developer detecting the types and let him handle when it not fits well.
Create a new transfer plane for exchanging metadata between capable producers and capable consumers.

One could think that this solutions:
- 1 could be a valid and righfull decision as it will avoid complexity and reduce source code size anbd cluttering.
- 3 would be a ideal scenario, but it could also be very laborius/ardous.
- 2 could be a interesting approach and even an intermediary step to 3.
- 2 may be a necessary measure to handle the transfer from no capable producers and capable consumers.

What you think about? Would be desired or acceptable any effort regarding having this in bonobo?

hartym commented 4 years ago

Hey.

Thanks for the detailed analysis/explanation.

From what I understand, I think that all use cases are already handled by ... python type system :) Let's say we have int16 and int32 types understood by some consumer. Then the only need is to have some type (as in python type) that contains enough metadata for the consumer to output the correct type.

There are two things that are not done by bonobo (but I pretty much think it's not its responsibility, although you're welcome to prove me wrong) :

Mapping types or objects from or into target domain (understanding an int16 is a 16 bits integer in the avro format for example (which is purely conceptual, I don't know nothing about avro format and capabilities). This is hardly doable in bonobo, because it would pretty much require knowledge about all possible target domains, which is unrealistic and not wanted anyway (would be too much generic and real world usage would most probably require rewriting it anyway). An example of vanilla-python library that does this is for example the json library (or is it simplejson ? I always forget the name). You get the mappings for most simple types builtin, but if you want to convert decimals, timestamps, etc. into json, you need to write your own converters. Which is fine as depending on your application, serializing a timestamp to json may mean a lot of different things that python cannot guess for you. Another example (on the producer side) is the one you describe above : well thought libraries like sqlalchemy allows you to map things to whatever types you need.
Ensuring a column's value is always of a given type at a given stage of a pipeline. This is something that bonobo does not enforce at the graph level but you're still allowed to do it pretty easily : you can have a kind of "identity" transformation (one that yields NOT_MODIFIED everytime) that adds a validation step (so in fact, it yields NOT_MODIFIED if your constraints are respected, for example type constraints, and raise an error otherwise). This should be trivial to implement and I'm not sure a generic tool for that would be useful, but it may be (a kind of node factory that takes a sort of stream schema and does the "if" stuff, tricky part for a generic version is "what is a schema?").

Do you have concrete cases that cannot be handled this way?

Also, I think you should focus on avro-only test cases, as if we are able to produce whatever the avro-related nodes expect and we ensure the said nodes are indeed working correctly, it does not matter to know what kind of node (or graph) produced the data. Not sure this is something you tried to do in your tests but as you're describing cases using remote databases, I prefer to state this, sorry if it's useless.

Sorry I still did not find the time to actually test what your code does but I'm on it as soon as possible, if you focus the merge request on having avro readers/writers using optional dependency (and with tests :p), I think we can integrate it pretty soon.

If you think that from the discussion another topic is worth considering, maybe we should open specific issues to discuss it ?

Thanks

juarezr commented 4 years ago

Hi,

Regarding focus I planning to continue in the following way:

Change the code in pull request for: a. Allowing the user/dev directly specify the desired schema. b. In absense of a specified schema try to create one from python types as best as one can.
Test a little bit with other RDBMS like Postgresql, MSSQL, Firebird for catching more issues with the design.
Write code for some tests cases simulating basic functionality like extraction from RDBMS.
Add some documentation to bonobo docs regarding output to Avro.
Further discuss the types mapping/remapping in another issue as sugested.

Also I need some help regarding the best way of:

Figuring where to place documentation for AvroWriter/AvroReader.
If the code follow guidelines.
If the are any mistake that I should avoid.
If the are any form that perform faster or is better suited for maintaining.

Resuming the discussion, regarding type mapping I pretty much agree with your conclusions:

Ensuring a column's value is always of a given type at a given stage of a pipeline probably is unnecessary and overkill.
Also mapping types or objects from a source domain probably is unnecessary and overkill.
Mapping types or objects into target domain is what I need to write Avro files with a defined schema, but for outputs like CSV or Json are meaningless.

What I'm think is worth exploring in bonobo for the type mapping is a simpler solution like:

Let bonobo and python type system continue handling field's values as is today.
Do not map or convert values besides what producers have already done. This way we avoid losing precision and corrupting data.
Explore ways of extracting original type information from producers that are capable of that, like RDBMS/SQLAlchemy.
Allow writers, like Avro, access this type information for automatically creating their typed fields when needed.

For instance, if I have a table in MySql created like this:

create table wrong_types (
    ftinyint    TINYINT,
    fsmallint   SMALLINT,
    fmediumint  MEDIUMINT,
    fint        INT,
    fdecimal    DECIMAL(9,2),
    ffloat      FLOAT,
    fbit        BIT
);

Today without extra type information besides python types it's only possible to create a schema like:

schema = {
  'fields': [
      {'name': 'ftinyint', 'type': 'long'},
      {'name': 'fsmallint', 'type': 'long'},
      {'name': 'fmediumint', 'type': 'long'},
      {'name': 'fint', 'type': 'long'},
      {'name': 'fdecimal', 'type': 'double'},
      {'name': 'ffloat', 'type': 'double'},
      {'name': 'fbit', 'type': 'bytes'},
  ],
  ...
  }

But knowing the type information one could create a better, smaller and faster schema like:

schema = {
  'fields': [
      {'name': 'ftinyint', 'type': 'int'},
      {'name': 'fsmallint', 'type': 'int'},
      {'name': 'fmediumint', 'type': 'int'},
      {'name': 'fint', 'type': 'int'},
      {'name': 'fdecimal', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 9, 'scale': 2},
      {'name': 'ffloat', 'type': 'float'},
      {'name': 'fbit', 'type': 'boolean'},
  ],
  ...
  }

The biggest offensor there is the mapping of DECIMAL(9,2) and BIT because it causes loss of precision and incorrect type handling. Most of other types causes only record/file size increases.

Should we continue this discussion in a separated issue?

lgtm-com[bot] commented 4 years ago

This pull request introduces 3 alerts when merging 145de5ad4a69d916402a8b80538d12df7092a62b into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method
1 for Module imports itself
1 for Unused local variable

lgtm-com[bot] commented 4 years ago

This pull request introduces 3 alerts when merging 165575dca601b6dc7a473555432a76bd13ff0cf2 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method
1 for Module imports itself
1 for Unused local variable

juarezr commented 4 years ago

Hi,

Can you review this pull request?

I think that I reached the point for starting the integration.

thanks

lgtm-com[bot] commented 4 years ago

This pull request introduces 2 alerts when merging 3fd0c889aa6899d334de100379be3fe129ee9181 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com

new alerts:

1 for Signature mismatch in overriding method
1 for Module imports itself

juarezr commented 4 years ago

Any news about this PR? Thanks

python-bonobo / bonobo

Add support for reading and writing avro files #367