sadikovi / spark-netflow

NetFlow data source for Spark SQL and DataFrames
Apache License 2.0
18 stars 11 forks source link

Exception when inferring version from corrupt files with ignoreCorruptFiles=true #62

Closed sadikovi closed 7 years ago

sadikovi commented 7 years ago

When specifying glob path with corrupt files, inferring schema fails if first selected file is not NetFlow file.

Currently ignoreCorruptFiles is not applied when inferring version from files, and will fail with exception below, if selected file is not a NetFlow file.

17/02/21 16:20:51 INFO DAGScheduler: Job 9 finished: load at <console>:23, took 0.285259 s
java.io.IOException: Corrupt NetFlow file. Wrong magic number
  at com.github.sadikovi.netflowlib.NetFlowReader.<init>(NetFlowReader.java:137)
  at com.github.sadikovi.netflowlib.NetFlowReader.prepareReader(NetFlowReader.java:80)

Note that this works correctly when files are correct, or file to infer version is a NetFlow file, or when version is provided.

sadikovi commented 7 years ago

I think we should just throw proper exception saying that it cannot infer version, and it should be specified manually, or one should check if files are correct.