sadikovi / spark-netflow

NetFlow data source for Spark SQL and DataFrames
Apache License 2.0
18 stars 11 forks source link

Ignore corrupt files 2 #59

Closed sadikovi closed 7 years ago

sadikovi commented 7 years ago

This PR takes more general approach of introducing ignoreCorruptFiles. It updates NetFlowFileRDD to respect Spark option spark.files.ignoreCorruptFiles. When this Spark option is true, files that are corrupt or not NetFlow files are ignored. If file partially corrupt, then only recoverable data is read (up to corrupted block), if reader fails to initialize, then empty iterator is returned from that file.

This change is also added to netflowlib, so reader can take option ignoreCorruptFiles (default is false) and, in case of failure, sets isValid() to false, and returns CorruptNetFlowHeader, which is no-op for most of the operations. When flag is true, SafeIterator is returned, that terminates on failure.

sadikovi commented 7 years ago

Remove NetFlowCorruptSuite.scala.

codecov-io commented 7 years ago

Current coverage is 95.94% (diff: 94.44%)

Merging #59 into master will increase coverage by 0.02%

@@             master        #59   diff @@
==========================================
  Files            21         21          
  Lines           908        913     +5   
  Methods         770        773     +3   
  Messages          0          0          
  Branches        138        140     +2   
==========================================
+ Hits            871        876     +5   
  Misses           37         37          
  Partials          0          0          

Powered by Codecov. Last update 9d405a6...cb20e75