scrapinghub / shublang

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
BSD 3-Clause "New" or "Revised" License
15 stars 8 forks source link

New parameter data_type in the evaluate() method #32

Open akshayphilar opened 4 years ago

akshayphilar commented 4 years ago

Currently, pipeline expressions are required to be terminated with _aslist or first based on the data type required at the output. However, given the fact that the required data type is known beforehand, it would be optimal and lead to more terse expressions if this were passed to the evaluate function, thus making the final transformation implicit.

@peonone @BurnzZ does this make sense?

BurnzZ commented 4 years ago

@akshayphilar I'm all good for terse expressions promoting brevity. Although I'm not sure what you meant on this part:

... if this were passed to the evaluate function, thus making the final transformation implicit.

Could you provide some examples on this? For instance, if we have the double first expression as documented in https://github.com/scrapinghub/shublang/pull/28, how will it change?

akshayphilar commented 4 years ago

Notwithstanding the current quirks in the re_search function, which will be fixed separately as part of #30, the current expression which looks like this

r"sub(r',', '') | re_search(r'#(\d+)') | filter(lambda x: x) | first | int | first"

could have all 3 terminating pipe functions | first | int | first peeled away based on certain conditions.

We will need to evaluate how this will impact us in terms of overall robustness of the pipeline as well as it's comprehensibility. Will send a PR so that we discuss the ramifications of implicit vs explicit transformations.

cc @VMRuiz