Closed isaacabraham closed 9 years ago
Unfortunately skip is very difficult to implement, and i think that is the reason that spark doesn't support it. If you search for "How do we skip the header?" in here http://researchcomputing.github.io/meetup_spring_2014/python/spark.html you will find that they use filter instead.
Yes, I had the same thought, it is difficult to implement as it could happen in arbitrary stages of the distributed pipeline. But yes, it was exactly to do the "skip header" case :-) I have indeed used a filter - without a mapi then it needs to compare the content rather than e.g. row index but again I imagine that mapi is not easy to implement either.
If possible, a skip() function should be added to the CloudFlow module.