gisaia / ARLAS-proc

Workaround about data ingestion with computing frameworks
Apache License 2.0
4 stars 0 forks source link

Simplify DataModel objects #116

Closed sfalquier closed 4 years ago

sfalquier commented 4 years ago

DataModel may only contain :

All other existing fields must be removed and given as argument(s) for methods/transformers that needs them.

laurent-thiebaud-gisaia commented 4 years ago

We shall remove the dynamicColumns from the DataModel. To go on checking that some columns are double (as it was done in DataFrameFormatter.withValidDynamicColumnsType) I propose to add a second parameter to ArlasTransformer:

abstract class ArlasTransformer(val requiredCols: Vector[String] = Vector.empty,
                                val doubleCols: Vector[String] = Vector.empty)

The checkSchema() method would be responsible of checking that some columns are indeed Double. It means that the customer application is now in charge of providing the good format (ex: replace "," with "." in some string columns to convert it to Double), each transformer checking that Double columns are as expected.

Thus DataFrameFormatter should check that the DataModel columns are well formatted (lat/lon are double, timestamp is long).

@sfalquier are you OK with it?

sfalquier commented 4 years ago

Since this doubleCols would be used by only few transformers, no need to have it in the parent class. If there is a need to factorize this check, provide a tooling method in io.arlas.data.utils package.

laurent-thiebaud-gisaia commented 4 years ago

OK. What about validating that DataModel columns are at expected type (like we used to do by checking that lat/lon were double)? Or do we suppose that the customer' application already did it?