haskell-hvr / cassava

A CSV parsing and encoding library optimized for ease of use and high performance
http://hackage.haskell.org/package/cassava
BSD 3-Clause "New" or "Revised" License
223 stars 107 forks source link

Question on csv programming exercise and encoding rows missing specific header keys #176

Open CoreyWinkelmannPP opened 5 years ago

CoreyWinkelmannPP commented 5 years ago

Problem Description

I have the need to take a collection of csv documents in a folder and merge them together into one really large csv document. The columns within each file will contain some overlapping columns and some unique columns. The script will read each of these files, merge them, and then write out the new csv file.

Solution I have working (but seems a little slow in comparison to a go or rust implementation)

Rust and Go on the data set would run this scenario in 100 to 200ms. The Haskell version below would do it in 300 to 400 ms. A python version was running within that 300 to 400ms realm as well which is why I think Haskell should be able to do this faster.

I have coded the following and originally I was hoping to stream through the files and process and build up the results using conduit but I ended up bailing and outputting the files and then going through those and processing them one off. I want a more efficient and idiomatic Haskell version for accomplishing this and was wondering if anyone would give me some insights on what that may look like. One issue I did come across with the below solution was that I had to change the cassava code to allow an empty string to be returned when the map lookup returned Nothing instead of failing like the current version does.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Conduit
import System.FilePath (takeExtension)
import Data.Csv
import qualified Data.Vector as V
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS
import qualified Data.Map as M
import Data.Either
import Data.List (nub)
import Control.Monad

type Column = M.Map BS.ByteString LBS.ByteString
type Rows = V.Vector Column
type CsvDocument = (Header, Rows)
type CsvDocuments = V.Vector BS.ByteString
type ErrorMsg = String

getCsvDocuments :: ConduitM a c (ResourceT IO) CsvDocuments
getCsvDocuments = sourceDirectoryDeep True "."
        .| filterC (\fp -> takeExtension fp == ".csv")
        .| awaitForever sourceFile
        .| sinkVector

mergeHeader :: Header -> Header -> Header
mergeHeader h1 h2 = V.fromList . nub . V.toList $ (h1 V.++ h2)

combineCsvDocuments :: CsvDocument -> BS.ByteString -> CsvDocument
combineCsvDocuments acc csv = (mergedHeader, mergedBody)
    where
        decodedCsv = fromRight (V.empty, V.empty) . decodeByName . LBS.fromStrict $ csv
        mergedHeader = mergeHeader (fst decodedCsv) (fst acc)
        mergedBody = snd acc V.++ snd decodedCsv

mapFiles :: CsvDocuments -> CsvDocument
mapFiles = V.foldl' combineCsvDocuments (V.empty, V.empty)

main :: IO ()
main = do
    files <- runConduitRes getCsvDocuments
    let document = mapFiles files
    LBS.writeFile
        "output/combined_response.csv"
        (encodeByName
            (fst document)
            (V.toList . snd $ document))
    return ()

About me

I am learning concepts from Haskell but still a beginner. I pick out some challenges and try them out in Haskell but I am always looking for feedback and better ways of doing them from more experienced individuals. I develop in Object Oriented Languages for my current role which I understand well. I am trying to expand my knowledge with gaining a better understanding of how Functional Programming can improve my development skills.

Thanks in advance for any help you can give!

CoreyWinkelmannPP commented 5 years ago

When implementing a solution for handling this content I had to compile a local copy of this library to bypass the hard coded error path when doing a lookup on the Map. At least, that is what would get it to work. Below is the function that is doing the lookup and failing on Nothing. In order to get the above example to work I had to change the Nothing case and have it return an empty string. That would then allow it to build any number of columns whether or not the rows had all of the data expected. I am still wondering what the best approach those here would take on this problem without having to change the internal library code. Thanks again for any insights you all have!

namedRecordToRecord :: Header -> NamedRecord -> Record
namedRecordToRecord hdr nr = V.map find hdr
  where
    find n = case HM.lookup n nr of
        Nothing -> moduleError "namedRecordToRecord" $
                   "header contains name " ++ show (B8.unpack n) ++
                   " which is not present in the named record"
        Just v  -> v
jkarni commented 2 years ago

I've come across the same problem. Encoding different CSVs and then merging them is expensive. Having the ability to decide (in, presumably, EncodeOptions) how to encode headers that are missing in the data seems like an overall nicer experience. If such a change would be welcome, I could submit a PR.