Open juarezr opened 4 years ago
This pull request introduces 2 alerts when merging 078fcccc973c01a2ab9fb510fb6121e7e8f5af7c into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com
new alerts:
This pull request introduces 1 alert when merging d62465adef183228583becfca5e5d20c2b849362 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com
new alerts:
Hello,
This should be an optional dependency, I don't think 90% of users want to install this dep ifthey are not going to use it.
Thanks a lot for your contribution, what's the status of the code ? Should I test that already ?
This pull request introduces 1 alert when merging adc443173c3076408823d3c08bd381ad7e995e96 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com
new alerts:
Regarding status of code:
Regarding optional dependency:
pip install fastavro
. Otherwise it would failIf you want testing, just pip install fastavro
in a environment and replace a CvsWriter by a AvroWriter.
What you think about?
Thanks
Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.
Bonobo uses python's type system, which does not allow implicit conversions.
For solving this the code has two strategies: a fields property for specifying the type of the fields automatic type dectection based on first row/record
Not sure what "code" you're talking about, this is not far from what bonobo does (see NodeExecutionContext).
It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something. I'll have a try as soon as possible.
Thanks !
Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.
Bonobo uses python's type system, which does not allow implicit conversions.
Let me explain in more details what I tried to tackle.
For writing to a avro file one should define a schema like this:
schema = {
'doc': 'A weather reading.',
'name': 'Weather',
'namespace': 'test',
'type': 'record',
'fields': [
{'name': 'station', 'type': 'string'},
{'name': 'day', 'type': 'int', 'logicalType': 'date'},
{'name': 'time', 'type': 'long', 'logicalType': 'time-micros'},
{'name': 'temp', 'type': 'int'},
{'name': 'umidity', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 4, 'scale': 2},
],
}
The workflow that I was trying to hadle is the most common in ETL processing:
select * from mytable
However, in the step 3, there wasn't type information for creating the required type schema.
The problem happens because the python's type system could be not enough to represent the richness of types of a RDMS database:
Using the current values types give by bonobo I got working the following types:
But I suffered with the following types:
I tried to solve this by two ways:
I'm looking for the best way to handle this issue.
It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something.
I agree with that. I just would like to settle the design before committing time writing more tests.
Thanks
For reproducing the issue I did the following:
select * from testing
in a Bonobo graph node
c. used AvroWriter for writing for a avro fileNotice that most types mapped to int, float, string, bytes, and datetime.datetime
create table testing (
fboolean BOOLEAN,
fchar CHAR(10),
fvarchar VARCHAR(10),
ftext TEXT(10),
ftinyint TINYINT,
fsmallint SMALLINT,
fmediumint MEDIUMINT,
fint INT,
fbigint BIGINT,
fdecimal DECIMAL(9,2),
ffloat FLOAT,
fdouble DOUBLE,
fbit BIT,
fdate DATE,
ftime TIME,
fdatetime DATETIME,
ftimestamp TIMESTAMP,
fyear YEAR
);
insert into testing values(
TRUE,
'abcdef', 'ghijkl', 'mnopqr',
1, 123, 32000, 66000, 1234567890,
123.456, 456.789, 123.789, 1,
'2019-12-25', '21:22:23', '2019-12-25 21:22:23', '2019-10-25 17:22:23',
2019
);
insert into testing values(
false,
'vidi', 'vini', 'vinci',
2, 121, 32023, 66066, 9876543210,
234.567, 567.890, 234.890, 0,
'2019-12-15', '15:22:23', '2019-12-15 16:22:23', '2019-10-15 17:15:23',
2018
);
I understand your point.
It should be possible to use different types than builtins, like for example one could use decimals (https://docs.python.org/3/library/decimal.html) to avoid wrong numbers on payckeck or numpy types to have rightly sized variables. There are two ways to do so and I think (but I may be wrong) it's not bonobo job to handle this (or at least, not more than providing a way to let the user do it.
Either your data producer already knows how to produce those types as an output (a db driver that would yield numpy integers, for example). In that case, job's already done, and bonobo will just pass those values through. Either your data producer produces other types (assuming they do not contain unwantable approximations) and you can have a node in charge of casting things. This is of course less effective, but may still work in certain situations as it will free up memory waste for further processing, and there should be a limited amount of rows waiting to be converted. This is already something you can do in a node.
So as I see it (but let me know if I'm wrong, you may have thought more of this), there is one "correct" way which is the responsibility of whatever talks with the data source, and one workaround which is possible.
Am I missing something ?
Or are you suggesting that you would need some metadata information storage about columns ?
Thanks for the summarizing the implications around the my point.
After reading that, I've found some new things about the issues:
Regarding the metadata information storage about columns, I am thinking that the importance of preserving the column types depends on the type of output and the level of effort and complexity of the development and use.
So the source column type information will/wont matter according the use case/scenario.
For instance considering only the output/destination:
For instance considering only the effort and complexity of the translation:
1
and 2
, it's not worth investing time and effort because the current behavior already suits well.3
I think the handling will depend in how one decide in balancing the effort needed for developing and the burden/benefits of solving automatically between:
The most common ETL use cases I now are:
Basically we have a combination between:
Considering all this it's possible to have some decisions like:
Create a new transfer plane for exchanging metadata between capable producers and capable consumers.
One could think that this solutions:
1
could be a valid and righfull decision as it will avoid complexity and reduce source code size anbd cluttering.3
would be a ideal scenario, but it could also be very laborius/ardous.2
could be a interesting approach and even an intermediary step to 3
.2
may be a necessary measure to handle the transfer from no capable producers and capable consumers.What you think about? Would be desired or acceptable any effort regarding having this in bonobo?
Hey.
Thanks for the detailed analysis/explanation.
From what I understand, I think that all use cases are already handled by ... python type system :) Let's say we have int16 and int32 types understood by some consumer. Then the only need is to have some type (as in python type) that contains enough metadata for the consumer to output the correct type.
There are two things that are not done by bonobo (but I pretty much think it's not its responsibility, although you're welcome to prove me wrong) :
Do you have concrete cases that cannot be handled this way?
Also, I think you should focus on avro-only test cases, as if we are able to produce whatever the avro-related nodes expect and we ensure the said nodes are indeed working correctly, it does not matter to know what kind of node (or graph) produced the data. Not sure this is something you tried to do in your tests but as you're describing cases using remote databases, I prefer to state this, sorry if it's useless.
Sorry I still did not find the time to actually test what your code does but I'm on it as soon as possible, if you focus the merge request on having avro readers/writers using optional dependency (and with tests :p), I think we can integrate it pretty soon.
If you think that from the discussion another topic is worth considering, maybe we should open specific issues to discuss it ?
Thanks
Hi,
Regarding focus I planning to continue in the following way:
Also I need some help regarding the best way of:
Resuming the discussion, regarding type mapping I pretty much agree with your conclusions:
from
a source domain probably is unnecessary and overkill. into
target domain is what I need to write Avro files with a defined schema, but for outputs like CSV or Json are meaningless.What I'm think is worth exploring in bonobo for the type mapping is a simpler solution like:
For instance, if I have a table in MySql created like this:
create table wrong_types (
ftinyint TINYINT,
fsmallint SMALLINT,
fmediumint MEDIUMINT,
fint INT,
fdecimal DECIMAL(9,2),
ffloat FLOAT,
fbit BIT
);
Today without extra type information besides python types it's only possible to create a schema like:
schema = {
'fields': [
{'name': 'ftinyint', 'type': 'long'},
{'name': 'fsmallint', 'type': 'long'},
{'name': 'fmediumint', 'type': 'long'},
{'name': 'fint', 'type': 'long'},
{'name': 'fdecimal', 'type': 'double'},
{'name': 'ffloat', 'type': 'double'},
{'name': 'fbit', 'type': 'bytes'},
],
...
}
But knowing the type information one could create a better, smaller and faster schema like:
schema = {
'fields': [
{'name': 'ftinyint', 'type': 'int'},
{'name': 'fsmallint', 'type': 'int'},
{'name': 'fmediumint', 'type': 'int'},
{'name': 'fint', 'type': 'int'},
{'name': 'fdecimal', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 9, 'scale': 2},
{'name': 'ffloat', 'type': 'float'},
{'name': 'fbit', 'type': 'boolean'},
],
...
}
The biggest offensor there is the mapping of DECIMAL(9,2)
and BIT
because it causes loss of precision and incorrect type handling. Most of other types causes only record/file size increases.
Should we continue this discussion in a separated issue?
This pull request introduces 3 alerts when merging 145de5ad4a69d916402a8b80538d12df7092a62b into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com
new alerts:
This pull request introduces 3 alerts when merging 165575dca601b6dc7a473555432a76bd13ff0cf2 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com
new alerts:
Hi,
Can you review this pull request?
I think that I reached the point for starting the integration.
thanks
This pull request introduces 2 alerts when merging 3fd0c889aa6899d334de100379be3fe129ee9181 into 274058db8b8d972e09d2f02942b4f52c009b4e8a - view on LGTM.com
new alerts:
Any news about this PR? Thanks
Add support for reading and writing Avro files using FastAvro.
Avro is faster and safer than other format as CSV, JSON or XML.
As Avro is typed, the fields types are detected from values. Once bonobo starts preserving types, they could be used for determining field types.
Tested with the workflow mysql -> sqlalchemy -> bonobo -> avro.
Publishing now for gattering sugestions.