orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.75k stars 871 forks source link

Loading CSV data by ETL override Float values #6839

Closed ghost closed 7 years ago

ghost commented 8 years ago

OrientDB Version, operating system, or hardware.

Load CSV file with 0 in first line (below title-row) and on next line, i.e. below the column with 0, a float value, e.g. 2.82942 will become 2

robfrank commented 8 years ago

If you know the types of the columns, you can configure the CSV extractor: http://orientdb.com/docs/last/Extractor.html#csv-extractor

{ "csv": 
    {  "predefinedFormat": "DEFAULT",
       "nullValue" : "N/A",
       "dateFormat" : "dd-MM-yyyy",
       "dateTimeFormat" : "dd-MM-yyyy HH:mm",
       "columns": ["name:string","createdAt:date","updatedAt:dateTime","value:long"]
    }
}
ghost commented 8 years ago

Thanks robfrank, you are absolutely right! However, there is a snag here; the nature of the data makes it impractical to specify each column and add the type. Setting up ETL jobs in such detail is not practical in the use cases here. In my mind the ETL process should auto-detect float and perhaps revert that property to float. Thanks anyway!

robfrank commented 8 years ago

I setup a test for this case, configuration is:

{source: { content: { value: 'firstNumber
10
"10,78"'}  }, extractor : { csv: {} }, loader: { test: {} } }

Which represents a csv like that:

firstNumber
10
10.78

So, the first line is "casted" to integer, the second to float. I guess you have configured ETL to create class properties setting dbAutoCreateProperties on the loader to true. So, it will map the first line to integer and will create the property in the DB as integer. Then on the second line the CSV extractor parse 10.78 to float, but it is stored in the db as integer.

Your suggestion could be dangerous. Suppose you have multiple lines, some lines with decimals and some without decimals? What is the intention of the user? Should the parser read all as integers? Or as floats? Or maybe the user want these values to be parsed as String.

The ETL tries to guess types in the best effort way, but if you need precision and you know your data, configuring columns is be the best way to do it. It gives more control to you with dirty data sets.

ETL in general is a dirty work :)

ghost commented 8 years ago

Sorry for my late response, but I beg to differ. The suggestion is not dangerous. We talk about column by column data. If someone mixes integers, strings and floats in the very same column, i.e. property, he or she got a problem. That is not what I am talking about. Well, yes ETL is dirty, by working with instruments(Mass Cytometry) in cancer research, which produces millions of data points, this approach could make a difference. Please also take a look at my dryrun suggestion in #6872

robfrank commented 8 years ago

Configuring ETL with data types will speed-up the ingestion process because ETL doesn't need to guess the data type for every line. Changing data types on the go implies a priority schema for types: is float more important than integer? Maybe for your use case. Maybe for another user integer should be preferred.

robfrank commented 8 years ago

Configuring ETL with data types will speed-up the ingestion process because ETL doesn't need to guess the data type for every line. Changing data types on the go implies a priority schema for types: is float more important than integer? Maybe for your use case. Maybe for another user integer should be preferred.

ghost commented 8 years ago

Again I disagree. OrientDB is marketed as a NoSQL database, schema less, schema full, hybrid, what have you. That is actually the beauty of it, great flexibility. You are voicing in an opposite direction. Most scientific data will contain different number of parameters and types, depending on the nature of the study/project in question. Hence manually (or auto generating) ETL/JSON file in such detail before each ETL job will be impractical/expensive and error prone.

I find the talk about priority hard to understand. If one measurement, from one property(column) is 0, and the next row, same column/property, is 1.274284, priority is irrelevant. That property must be set to float, not integer as it is today. Current default behavior is a bug.

You also argued that ETL is dirty work. Well, it doesn't have to be that way. You probably don't want to, but effectively you make things harder. There is no way anyone will enter scientific data from the keyboard, hence ETL is a critical path to success for OrientDB, or any database for that matter. Bioinformatics is definitely on the rise and the volumes do not get smaller. Please reconsider.

santo-it commented 7 years ago

Hi @austx,

Many thanks for opening this issue and for your feedback

I have discussed internally with @robfrank and we are considering to support the following additional case:

This will handle the case of the following situation: a column (property) where you have different data types, e.g.

0 2.8 "hello"

will be imported in OrientDB in schema-less mode, with no errors, thus giving the users lot of flexibility

In your specific case, you will have some integers and some decimal values (and no cast will happen), e.g. you fill find in the database the following values:

0 2.82942

Note that after the ETL is complete, you can still, if you want, create a DECIMAL property on this column

I believe that with this additional case we are considering to handle we will give more flexibility to our users. Obviously, one may still decide to define the data type before the import, and a cast will be done

Warning: when using schema-less with no checks on the data type, users must know what they are doing (as they may end up with having strings, integers, decimals and other types on the same property. Sometime they may want right this - and some other times lack of schema will prevent the user to find that that "string" is a wrong value. I believe this is kind of compromise we have to accept to have maximum flexibility)

Will it be a good compromise for you?

Many thanks,

ghost commented 7 years ago

Hi santo-it et al.,

thanks for getting back to me. I use schema less when loading new data from mass cytometry. I have included an example(converted to csv). I use a shell script when preparing data for ETL, so this is the raw version before my prep. The prep does several operations, one is substituting 0 by 0.0. Please have a go at the included file and consider the options.

I could appreciate the fact that in some odd cases strings, e.g. "hello" would end up in a number column. Personally I believe this should ring an alarm, but then again, there is a limit to what OrientDB/ETL should be required to perform.

Also, when it comes to bad data, the debug flag ERROR could perhaps be a bit more verbose, what about switching on some of the DEBUG features when an error in the data is detected?

Thanks for your understanding, very much appreciated.

On 21 November 2016 at 12:04, santo-it notifications@github.com wrote:

Hi @austx https://github.com/austx,

Many thanks for opening this issue and for your feedback

I have discussed internally with @robfrank https://github.com/robfrank and we are considering to support the following additional case:

  • schema less (so no properties defined)
  • no cast (the values will be imported as they are)

This will handle the case of the following situation: a column (property) where you have different data types, e.g.

0 2.8 "hello"

will be imported in OrientDB in schema-less mode, with no errors, thus giving the users lot of flexibility

In your specific case, you will have some integers and some decimal values (and no cast will happen), e.g. you fill find in the database the following values:

0 2.82942

Note that after the ETL is complete, you can still, if you want, create a DECIMAL property on this column

I believe that with this additional case we are considering to handle we will give more flexibility to our users. Obviously, one may still decide to define the data type before the import, and a cast will be done

Warning: when using schema-less with no checks on the data type, users must know what they are doing (as they may end up with having strings, integers, decimals and other types on the same property. Sometime they may want right this - and some other times lack of schema will prevent the user to find that that "string" is a wrong value. I believe this is kind of compromise we have to accept to have maximum flexibility)

Will it be a good compromise for you?

Many thanks,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/orientechnologies/orientdb/issues/6839#issuecomment-261907867, or mute the thread https://github.com/notifications/unsubscribe-auth/ASkWeOmg1S8XzVK9Yxu6NZKAcrkwAb-aks5rAXrTgaJpZM4KfOoq .

Tore Austrått phone +47 90657231