twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 704 forks source link

TypedTSV overwrite existing folder on hdfs #1893

Open vsacheti opened 5 years ago

vsacheti commented 5 years ago

we recently found some of our folder were accidently deleted, and on doing more research, I found TypedTSV overwrites the folder if it already exists. I used the following simple script to test it.

val inputPath = args("input")
  val outputPath = args("output")
  val output = TypedPipe.from(TextLine(inputPath))
                        .map(p => p + 10)
  output.write(TypedTsv(outputPath))

I ran the above script with the arguments something like --hdfs --input in1 --output out1 where in1 and out1 are folders on hdfs

I was doing more research on this, from the following link I found the default behavior should be KEEP. http://docs.cascading.org/cascading/1.2/userguide/html/ch03s03.html

Here is the excerpt from the above link

SinkMode.KEEP
This is the default behavior. If the resource exists, attempting to write to it will fail.

I am little confused as I dont see SinkMode as a parameter to TypedTSV constructor, so maybe I am mixing two unrelated things.

But fundamentally things should not get deleted.

Will really appreciate if somebody can look at this and probably explain the above behavior.

Thanks