ozataman / csv-conduit

Flexible, fast and constant-space CSV library for Haskell using conduits
Other
52 stars 32 forks source link

Named CSV instance eats errors #24

Open MichaelXavier opened 7 years ago

MichaelXavier commented 7 years ago

I notice that the instance for CSV eats errors on parse. It first parses rows into NamedRecord and then parses to an Either String a, discards the error message and skips the row. I found this surprising as default behavior and it does not appear to be documented. In my case, If you parse a CSV file with some invalid rows, it simply skips them and returns a successful result with the rows that worked. If you have a missing column, it parses successfully and returns that there were no rows. The MonadThrow requirement for IntoCSV seems to be unused.

I could see a few ways forward:

  1. This is undesirable default behavior and the Named instance will throw a new exception that wraps the String error message. We could create a newtype like Named called NamedSkipErrors or something more pithy that uses this current implementation of eating errors.
  2. This is valid default behavior and we don't want to break existing installs, so we create a NamedThrowErrors or something more pithy that throws on invalid rows.
  3. This could be added to either of the above as newtype NamedOrError a = NamedOrError (Either String a) which also never throws but captures per-row errors and lifts them into the types so they can be dealt with.

Because these are all solved by newtypes, this is something I can do myself in userland code without changing the library in the mean time.

MichaelXavier commented 7 years ago

One thing I noticed that may throw a wrench in this plan is unfortunately the CSV typeclass requires both to/from instances, unlike say aeson which splits the instances. This means option 3 is not a great idea for the library because it would have to write a csv instance for an error, which isn't really sensible.

bitemyapp commented 7 years ago

@MichaelXavier I would definitely prefer it always preserve the error information. I'd like the option of the following:

I think the default behavior is bad enough that it is worth breaking. What do you think?

bitemyapp commented 7 years ago

Secondary question: I have a potential workaround for my current use-case, but I would need to be able to parse the rows as dumb, un-named columns. It's not obvious to me how to do that from csv-conduit's source code. What am I meant to do there? Am I meant to deal with Row ByteString manually?

MichaelXavier commented 7 years ago

I'll have to think about the suggestion later but for parsing unnamed columns you probably want the CSV instance for CSV ByteString (Row ByteString) or Row Text. Also because of CSV s (Row s) => CSV s (Vector s), I think you should also be able to get Vector ByteString too.

bitemyapp commented 7 years ago

@MichaelXavier okie, I think the other thing is I'll need to figure out how to skip the header. Is there a built in means of doing so such as exists in cassava?

ozataman commented 7 years ago

Just chiming in on the last question - you can do that simply by using the monadic nature of conduit - consume/discard one row, continue with the rest of your consumer in the same monadic block. i.e. it's just C.drop 1 from Conduit.List

ozataman commented 7 years ago

I may have to think a little more on the issue, but the main reason why that happened is because of the streaming nature of the library. It's meant for those cases where you're working with large amounts of data and you don't expect to be able to load it all up in one go. If conduit had a structure where it allowed to produce 2 concurrent streams of stuff, somehow, then we'd definitely be emitting one error stream and one parsed stream. On a separate note, maybe that's what the machines lib would allow?

Anyway, what we definitely don't want is any solution that would:

So the "rows that succeeded" sort of solution would never work, as that defeats the streaming nature and constant space guarantees.

That said, I definitely think this was incomplete (maybe even possibly flat out bad, though a good number of practical data science real world use cases will still prefer the current behavior) design from way back and we should improve it somehow by giving visibility into errors. At a glance, option #3 seems most natural to me right now. We would add something like that as an additional instance, basically.

I also really like @bitemyapp's idea for additional metadata around the error - i.e. the row number. The issue with that is we'd need to introduce bookkeeping machinery that would slow down every single case out there even when user doesn't care. I wonder if there's a way to make that optional as well... Maybe we do it for some instances only?

bitemyapp commented 7 years ago

@ozataman There's a lot to reply to here, thank you! I'll do quoted replies for context if no-one minds:

consume/discard one row, continue with the rest of your consumer in the same monadic block. i.e. it's just C.drop 1 from Conduit.List

Conduit sources of vanilla ByteStrings only drop line by line rather than mysterious chunking behavior if I explicitly use a combinator that splits on line. I'm guessing you mean after I yank a Conduit out of fromCSV.

It's meant for those cases where you're working with large amounts of data and you don't expect to be able to load it all up in one go

That's precisely my situation. I am parsing 1gb++ CSV reports from Amazon. Here's the thing though, if I get a row number in my error log, I can stream/seek to the row number for debugging on a reacquisition of the CSV report. Usually I can debug/deal with CSV rows as individual datum but I'm not always that lucky and need to "look around" nearby data for context.

If conduit had a structure where it allowed to produce 2 concurrent streams of stuff, somehow, then we'd definitely be emitting one error stream and one parsed stream

I must be missing something, is there a reason it can't be a single stream of alternating constructors? We're talking about a case that explicitly chooses not to short-circuit so I don't know why you'd mind them being in a single stream.

Anyway, what we definitely don't want is any solution that would:

I concur with these constraints. It's just extremely unfortunate that the current behavior ghosts the failed named rows because we plan to clean up our handling of long tail data through error logging.

So the "rows that succeeded" sort of solution would never work, as that defeats the streaming nature and constant space guarantees.

cf. alternating constructors --- it's up to the consumer of the Conduit to decide how to deal, because frankly, how to deal with this is truly application specific. I might want to log them. I might want to save them to a db table. I just want a stream of successes and failures (sometimes!)

At a glance, option #3 seems most natural to me right now. We would add something like that as an additional instance, basically.

I think #3 fits my non-short-circuiting model I had in my head, just with a slightly awkward representation. I would probably blend it with #1, push out original, bad-default-behavior to NamedSkipErrors, make named short-circuit properly.

bitemyapp commented 7 years ago

@ozataman Are you sure that,

i.e. it's just C.drop 1 from Conduit.List

Works? Here's what I've tried so far:


testTake :: IO ()
testTake = do
  let incsv :: CSVSettings -> Conduit ByteString IO (Row ByteString)
      incsv = intoCSV
  runConduit $
       sourceLbs (fromStrict testDataActiveListingEng27)
    .| incsv (CSVSettings '\t' Nothing)
    .| mapM_C (pPrint :: (Row ByteString) -> IO ())
Prelude> testTake
(...prints header and data row...) 
testTake :: IO ()
testTake = do
  let incsv :: CSVSettings -> Conduit ByteString IO (Row ByteString)
      incsv = intoCSV
  runConduit $
       sourceLbs (fromStrict testDataActiveListingEng27)
    .| incsv (CSVSettings '\t' Nothing)
    .| dropC 1
    .| mapM_C (pPrint :: (Row ByteString) -> IO ())

prints nothing,

Prelude> testTake
Prelude> 
testTake :: IO ()
testTake = do
  let incsv :: CSVSettings -> Conduit ByteString IO (Row ByteString)
      incsv = intoCSV
  runConduit $
       sourceLbs (fromStrict testDataActiveListingEng27)
    .| dropC 1
    .| incsv (CSVSettings '\t' Nothing)
    .| mapM_C (pPrint :: (Row ByteString) -> IO ())

prints nothing,

Prelude> testTake
Prelude> 

Edit:

ditto Vector in place of []

bitemyapp commented 7 years ago

If I attempt to make the ByteString line-separated before the CSV parser has a whack at it, it breaks the parser:

testTake :: IO ()
testTake = do
  let incsv :: CSVSettings -> Conduit ByteString IO R
      incsv = intoCSV
  runConduit $
       sourceLbs (fromStrict testDataActiveListingEng27)
    .| linesUnboundedAsciiC
    .| incsv (CSVSettings '\t' Nothing)
    .| mapM_C (pPrint :: R -> IO ())

Results in one big row (header + data row) being returned because the separator got dropped.

bitemyapp commented 7 years ago

FWIW, cassava-conduit does the right thing here and has functions which skip the header automatically.

type R = [ByteString]

decodeOpts = C.defaultDecodeOptions {
      C.decDelimiter = fromIntegral (ord '\t')
    }

testTake :: IO ()
testTake = do
  let incsv :: CSVSettings -> Conduit ByteString IO R
      incsv = intoCSV
  res <- runEitherT $ bimapEitherT show id $ runConduit $
              sourceLbs (fromStrict testDataActiveListingEng27)
           .| C.fromCsv decodeOpts C.HasHeader
           .| mapM_C (liftIO . (pPrint :: R -> IO ()))
  print res

Only prints the data row for me.

ozataman commented 7 years ago

@bitemyapp Not having the code in front of me, I can only guess that C.drop is sink-like so it never actually produces anything. That's why my guess was to use it inside of the monadic block. I.e. something like:


= do
  C.drop 1
  restOfMyLogic

... as opposed to using fuse operators to add another step.

Not sure if that'll work, but you can ultimately even use await directly to discard. Your comments were correct - you need to drop after CSV parsing takes place so you drop the parsed row and not chunks of ByteString, which of course doesn't make sense as you've also noted.

bitemyapp commented 7 years ago

Ah! That did it!

testTake :: IO ()
testTake = do
  let incsv :: CSVSettings -> Conduit ByteString IO R
      incsv = intoCSV
  runConduit $
    let d :: Monad m => ConduitM a o m ()
        d = dropC 1
        s :: ConduitM a R IO ()
        s = sourceLbs (fromStrict testDataActiveListingEng27)
            .| incsv (CSVSettings '\t' Nothing)
        m = (mapM_C (pPrint :: R -> IO ()))
    in s .| (d >> m)

Returns only the data row. This was supremely helpful, thank you @ozataman!

MichaelXavier commented 7 years ago

Sounds like you both like option 3 (and either-wrapping newtype which doesn't throw exceptions). This seems nice to me too but as I mentioned, the library debateably makes the same mistake as cereal: it forces a to/from instance at minimum. So I could easily write intoCSV :: CSVSettings -> Conduit s m (Either CSVError r) and piggyback on existing instances for r to be Named a, Row s etc. However, now I need to write fromCSV :: CSVSettings -> Conduit (Either CSVError r) m s. That means I must confront the case where the stream gives me a Left. What do you guys think of the semantics being such that the fromCSV implementation for this just drops Left? I can see no other sensible alternative. CSVs by design tend to model 1 type, so any effort to serialize the error into a row will certainly break the output.

I guess the real issue at hand here is we're trying to use a data structure designed for one side of the process, parsing, and not for serialization, so if you were to use this with transformCSV, your provided transformation conduit would be responsible for handling errors because they would be otherwise stripped from output. The more I think about this the less weird it sounds.

bitemyapp commented 7 years ago

@MichaelXavier I'd make it a one-time big break and uncouple to/from, if it were up to me.

Have the CSV class sit atop To/From classes perhaps for temporary compat, but I'd uncouple them regardless. Rip the bandaid off so that we don't have work left on the table that we already knew needed done.

MichaelXavier commented 6 years ago

@ozataman I'm inclined to agree with @bitemyapp. Shall I take a pass at splitting the class into to/from and then implement option 3 which doesn't swallow errors but wraps an Either? We could keep the typeclass which combines them (CSV). I think we could even make it so it just provides a default instance for when you already have ToCSV and a default instance for FromCSV, so existing instances would stay compatible. We could drop that later but there would be no rush.

I think I'd have to get into the code to see if this would cause API breakage, but honestly this library is stable enough that we could cut a new version with the improved types and people on the old version would not really have any pressure to upgrade existing projects.

ozataman commented 6 years ago

Sounds good to me at a high level - happy to discuss specifics once you get into it.

One helpful thing would be to somehow keep combinators that wrap over Either to keep current behavior so it’s very easy to adapt to new API at use sites for old code.

On Tue, Sep 26, 2017 at 10:44 AM Michael Xavier notifications@github.com wrote:

@ozataman https://github.com/ozataman I'm inclined to agree with @bitemyapp https://github.com/bitemyapp. Shall I take a pass at splitting the class into to/from and then implement option 3 which doesn't swallow errors but wraps an Either? We could keep the typeclass which combines them (CSV). I think we could even make it so it just provides a default instance for when you already have ToCSV and a default instance for FromCSV, so existing instances would stay compatible. We could drop that later but there would be no rush.

I think I'd have to get into the code to see if this would cause API breakage, but honestly this library is stable enough that we could cut a new version with the improved types and people on the old version would not really have any pressure to upgrade existing projects.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ozataman/csv-conduit/issues/24#issuecomment-332278465, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA1_D7Xq6Hb9ftwKorFuIMRwthrDZfLks5smTfpgaJpZM4MbgeS .

--

Ozgun Ataman

CEO Soostone Inc. 25 Broadway, 9th Floor New York, NY, 10004

+1 917 725 0849 (Office) +1 734 262 0676 (Mobile) +1 212 320 0251 (Fax) ozgun.ataman@soostone.com