sadikovi / spark-netflow

NetFlow data source for Spark SQL and DataFrames
Apache License 2.0
18 stars 11 forks source link

Spark 2.x support #61

Closed sadikovi closed 7 years ago

sadikovi commented 7 years ago

This PR adds support for Spark 2.x (specifically any 2.0.x and 2.1.x). Done as subclass of FileFormat, without write support. Build file is updated to test all target Spark 2.x versions. This work will close some of the issues, e.g. using InternalRow instead of Row and refactoring RDD methods.

Since datasource API has changed significantly, some relevant files have been removed, such as NetFlowRDD or NetFlowFileStatus. We also do not have control over partitioning, so this feature is removed too. Statistics on columns are also removed (except header information about time range).

Options that are left:

See updated README for more information.

codecov-io commented 7 years ago

Codecov Report

Merging #61 into master will decrease coverage by -2.41%. The diff coverage is 99.22%.

@@            Coverage Diff             @@
##           master      #61      +/-   ##
==========================================
- Coverage   95.94%   93.54%   -2.41%     
==========================================
  Files          21       12       -9     
  Lines         913      418     -495     
  Branches      140       32     -108     
==========================================
- Hits          876      391     -485     
+ Misses         37       27      -10

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ce2764e...6fa47b8. Read the comment docs.

sadikovi commented 7 years ago

Closes #56, #60, #47.