modelop / hadrian

Implementations of the Portable Format for Analytics (PFA)
Apache License 2.0
130 stars 49 forks source link

Batch processing with PFA #50

Closed karims closed 6 years ago

karims commented 6 years ago

Unsure of where to post this question, I am asking it here. If there is a better forum, let me know as I could not join Slack.

I know PFA modelling is tied to individual datum. Is there a way to model on batch data? Like, one of my use case is taking CSV into spark data frame and doing a sort or groupBy. Is such an operation possible here?

jpivarski commented 6 years ago

You're looking for a reducer (groupBy is a particular reducer that also fills a map). The third of PFA's there method options is a reducer mode. It still gives a score for each datum, one by one, but it does so in a way that accumulates). Search for references to tally). It also forces you to write a combine function that combines partial tallies, in case you're running Hadrian independently on many batches and need to combine partial results.