mbraceproject / MBrace.Core

MBrace Core Libraries & Runtime Foundations
http://mbrace.io/
Apache License 2.0
211 stars 46 forks source link

CloudFlow.skip #68

Closed isaacabraham closed 9 years ago

isaacabraham commented 9 years ago

If possible, a skip() function should be added to the CloudFlow module.

palladin commented 9 years ago

Unfortunately skip is very difficult to implement, and i think that is the reason that spark doesn't support it. If you search for "How do we skip the header?" in here http://researchcomputing.github.io/meetup_spring_2014/python/spark.html you will find that they use filter instead.

isaacabraham commented 9 years ago

Yes, I had the same thought, it is difficult to implement as it could happen in arbitrary stages of the distributed pipeline. But yes, it was exactly to do the "skip header" case :-) I have indeed used a filter - without a mapi then it needs to compare the content rather than e.g. row index but again I imagine that mapi is not easy to implement either.