twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

TypeDescriptor for Option[String] does not make sense. #1375

Closed johnynek closed 8 years ago

johnynek commented 9 years ago

We should abort when we hit Option[String]. When writing an Option[String] to a text file, a None is indistinguishable from "" (empty string).

Right now, I guess if it is empty, we are returning None, so if someone writes Some(""), then we would read None, which is obviously bad news.

If I am right, this means our tests are weak, because we should be testing round-trips of this kind.

johnynek commented 9 years ago

I think Option[T] only makes sense for primitives or case classes that include at least one primitive.

ianoc commented 9 years ago

At one point we only allowed I thought Option[primitive] to try side step these issues. The decoding stuff can be a bit more of a pain. Is it defined if part of an Option[CaseClass] exists? i guess its None, or should the others be null's?

isnotinvain commented 9 years ago

I'm not sure if we should abort on Option[String] -- it's true that TSV can't differentiate between Some("") and None -- but that's just what you get w/ TSVs. I think this is expected / the right behavior right?

If we were going to do anything defensive, I'd say it'd be to disallow Some("") (throw in the write path)

johnynek commented 9 years ago

I don't like throwing exceptions at all, but there is a good point here: what is this macro doing? It is called TypeDescriptor but the default macro is also flattening and what I'm saying here is even more than flattening, but flattening into strings.

We could use a subclass of this with the TypedText sources, something like TextTypeDescriptor (something similar to what was done with DBTypeDescriptor). This subclass could have the more constrained semantics?

Might be good to settle this issue before we publish a public version of the code with some possibly ambiguous behavior.

What solutions sound best now?

sid-kap commented 9 years ago

I think TypedText.tsv shouldn't allow Option[String] because then it can't obey the round-trip law. Maybe we should have

[Slightly off topic:] We might also throw in a TSV TypedSink that allows any type and uses the toString method to serialize things that aren't primitives, options, or case-classes. This could be useful for debugging.

If we have something like this, the user will never be surprised by what happens.

Gabriella439 commented 9 years ago

I'm in favor of not allowing Option and if the user wants to write out an Option[String] to TSV then they have to explicitly convert the Option[String] to a String by doing something like _.getOrElse(""). This forces the user to intentionally opt-in to the non-round-trip behavior so they are aware of the risks and they take responsibility for the conversion. Doing this lossy conversion without the awareness of the user will just incur a support burden from us explaining to users why it doesn't round-trip.

I know that TSV is not really a production data format, but the extra code required for the user to do an explicit conversion from Option[String] to String is not that high, so why take any risks?

ianoc commented 9 years ago

Given its not a production format and the end users this will target are for either debugging (in which case best effort write but write as much as you can seems the ideal), or for data scientists importing into another system?

I'd be infavor of asking our data scientists who use scalding a lot for input here as the users

nathantchan commented 9 years ago

I agree with Gabriel. It makes sense to force the user to decide how to handle the None case explicitly when they write their output to TSV. However, once they load that TSV again, it makes more sense to me to read in the empty case as Some("") rather than None. It seems more intuitive to think of a "null string" as an empty one.

csaid commented 9 years ago

I think I disagree with Nathan and Gabriel on this. Here's one of the top things on my Scalding wishlist.

val myPipe = {
    TypedPipe.from(List(
         (4, None,      Some(7)),
         (3, Some("a"), None)
         )
    )
}

myPipe.save(TypedOsv("myFile.osv"))

For ad hoc analysis, I would strongly prefer to get output like this

4,,7
3,a,,

rather than this:

4,None,Some(7)
3,Some(a),None

I'd be willing to sacrifice round-trip ability on Strings if I can reduce the amount of .getOrElse lines I have to add to all my ad hoc jobs.

csaid commented 9 years ago

@sid-kap , would the last of your four classes be able to do what I described in my previous comment?

sid-kap commented 9 years ago

Yes, that's what I had in mind.

Gabriella439 commented 9 years ago

What about this translation scheme:

Some("") => ...,"",...
None     => ...,,...

So in Chris's example, you would get:

4,,7
3,a,

... but if they were Some("") instead of None you would instead get:

4,"",7
3,a,""
csaid commented 9 years ago

I'd be fine with @sid-kap 's approach, which would have a strict sink for people who want to be explicit and a looser sink for people who just want to do their ad hoc sampling quickly, and don't ever need to fulfill a round trip. I'd also be totally fine with Gabriel's solution for Strings, assuming that (Some(5), None, Some(6)) would get converted to 5,,6. With Gabriel's approach, there only needs to be one type of sink.

sid-kap commented 9 years ago

Slightly off topic:

That reminds me, I noticed that our TypedText classes don't support quoted values. A few weeks ago I helped a fellow intern write a subclass of TypedTextDelimited that allowed quoted values, so that he could read in numbers like "1,000,000" from a CSV.

@csaid Do you think we should support quoted values? It's part of the CSV spec, so I think we should allow it. (https://tools.ietf.org/html/rfc4180)

sid-kap commented 9 years ago

Supporting quotes might conflict with Gabriel's suggestion.

csaid commented 9 years ago

I'd be happy with Gabriel's solution even if it meant losing the ability to work with quoted values. I never deal with them.

ianoc commented 9 years ago

Quotes are a minefield, i believe though our TypedText stuff does support them in so much as the underlying cascading decoders do. quote handling always gets iffy in csv type data. Usually best to avoid...

johnynek commented 9 years ago

@csaid keep in mind the current TypedText.osv does not write "Some(a)" or "None" it does the "right" thing.

What I'm concerned with here is this: who is reading the data if we go with a non-round-trip format? Aren't we just sowing confusing and giving people more rope?

If you want debug, why not do: .map(_.toString).write(TypedText.tsv("output"))? Won't that work?

My preference is:

1) Simplicity: there should be one way to do this. 2) Correctness: if I write data, I should be able to read it and have it round-trip.

So, I don't like the multiple options for sinks/sources and a different behavior for sources which are also sinks.

As for quoting, I think this could be an interesting way to go, but as Ian says, we'd probably have to abandon using cascading's parsing and put our own logic in.

Practical question: do people have a compelling example of wanting to use Option[String]? When they do, do they care about distinguishing None from Some("")? My guess is that Option[String] is only coming up because they got from a scrooge thrift that was all optional, and they didn't do .getOrElse("") on it.

Lastly: note we can support Option[(String, Int)] because if the Int is present, the string is. Similarly, we could add an extra column when we see Option[String] to indicate if the string is present (like add a 1 or 0 character). The thing I don't like about this approach is that it limits portability which is the main reason to use a text format like this.

csaid commented 9 years ago

[I'm on vacation right now, so I won't be able to respond quickly to everything in this thread.] Oscar, I didn't realize that the current TypedText.osv does the "right" thing (i.e. not writing "Some(a)" or "None"). That's great, and it removes an extra step from my ad hoc work. It sounds like what's on the table now is adding that extra step back in, at least for Strings.

Those of us consuming data in a non-round-trip way are analyzing CSV files in R or Python and are willing to tolerate a little ambiguity around rare edge cases in exchange for ease of use.

If we want to avoid multiple sinks and quoting, then my vote is to stick with the current behavior.

Relatedly, I'd find it very useful if we could get to a point where something like this (which is very common after a .leftJoin)

val myPipe = {
    TypedPipe.from(List(
         (1, Some(Some("foo"), None)),
         (2, Some(None, Some(12))),
         (3, None)
         )
    )
}

could get written out as

1,foo,
2,,12
3,,

without having to manually flatten things with a .map and a lot of .getOrElse statements. You all know much better than me, but I suspect that allowing Option[String] makes it easier to get to this point, right?

johnynek commented 9 years ago

@csaid Is that last line (3, None) or (3, Some(None, None))?

If we support this, then you get ambiguous output, which is exactly what we don't want. Currently you can't do two layers of nesting Option for this reason.

I don't think dealing with the nesting that happen in joins here is the best solution because often we won't write out after those operations. Have you seen MultiJoin: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/MultiJoin.scala#L403 this handles some of this issue for you.

joshualande commented 9 years ago

This is a very interesting discussion.

From an ease of use standpoint, I would prefer both the empty string and None to write out to the same thing in csv, and round trip it back into None.

Practical question: do people have a compelling example of wanting to use Option[String]?

Frequently, my ad hoc code has to deal with optional things (like a field in an upstream table that isn't set, and I want to save that state to an intermediate table).

When they do, do they care about distinguishing None from Some("")

For me, the empty vs NA seems totally ambiguous and not well supported by cvs. I wouldn't trust my downstream analysis to do the right thing. So if I every cared about empty strings, I would probably just map them to something else (like "-") to avoid confusion.

isnotinvain commented 9 years ago

I agree with @joshualande -- lets not forget that the whole point is to write some data to a TSV, users of TSV need to be aware that None and "" mean the same thing in TSV, and probably expect and rely on that, this is true in many languages (java libraries usually encode null as "" in TSV, python libraries do the same, and so on).

isnotinvain commented 9 years ago

I'm also not in favor of making a "clever" or more fully featured TSV format (ones with quotes, escaping, and all that). It just adds complexity, and at the end the format is still a bad format.

What about strings that contain the delimiter (tab) character? What about strings with newlines in them? I think users of this format understand that it's a very simple format and doesn't handle things like that.

csaid commented 9 years ago

Agree with Josh/Alex on this issue. Oscar, sorry, my example didn't show what I'd get after a leftJoin. Here's a complete example, which I encounter very commonly.

val myLeftPipe = {
    TypedPipe.from(List(
         (1, "a"),
         (2, "b")
         )
    )
}

val myRightPipe = {
    TypedPipe.from(List(
         (1, Some("x"), Some("y")),
         (1, None, Some("z"))
         )
    )
}

val g1 = myLeftPipe.groupBy(_._1)
val g2 = myRightPipe.groupBy(_._1)

val multiJoined = MultiJoin.left(g1, g2).toTypedPipe

This gives me a TypedPipe[(Int, ((Int, String), Option[(Int, Option[String], Some[String])]))]. I'm looking for a simple way to get from that to something flat like this: 1, a, x, y 1, a, , z 2, b, , , I might be missing something, but I'm not sure what is ambiguous about this. I suppose that a consumer of this CSV might not know whether the last record was caused by a missing key in myRightPipe or by two missing values in myRightPipe. But that's an ambiguity with all SQL style left joins, afaik.

Sorry if this is a bit off-topic.

isnotinvain commented 9 years ago

The ambiguity is easy to explain, if you write: ("foo", None, 7) to a CSV file, you get foo,,7 if you write: ("foo", Some(""), 7) to a CSV file, you also get foo,,7

so when we read foo,,7 which one should we give you? We don't know how we got here, so, if you write and then read, you won't get exactly the same results (all your Some("") will have been turned into None)

csaid commented 9 years ago

Oh yeah, I understand that ambiguity, and I think several of us are saying it's ok to tolerate it. For some reason, I thought Oscar was talking about a different ambiguity caused by flattening the results of a .leftJoin.

johnynek commented 9 years ago

So, it sounds like everyone wants to keep the ambiguity.

I hear your point Alex that this thing is already a bit crappy: in that we are not 100% sure (and doubt) that it handles strings that contain the delimiter.

To me, that just makes me want to fix that issue too. I really hate mine-fields, and I hate errors at runtime that could have been turned into compile-time errors.

We could implement another function (probably with a macro), that flattens Option[String] into String inside of tuples. Something like:

trait GetOrElseEmpty[T] {
  type U // result type
  def unbox(t: T): U
}

Seems like it would work:

defined trait GetOrElseEmpty

scala> 

scala> //then we would have macros to create things like:

scala> new GetOrElseEmpty[(Int, Option[String])] {
     |   type U = (Int, String)
     |   def unbox(t: (Int, Option[String])) = (t._1, t._2.getOrElse(""))
     | }
res2: GetOrElseEmpty[(Int, Option[String])]{type U = (Int, String)} = $anon$1@2050d5d

scala> implicit val go: GetOrElseEmpty[(Int, Option[String])] = res2
go: GetOrElseEmpty[(Int, Option[String])] = $anon$1@2050d5d

scala> def unbox[T](t: T)(implicit goe: GetOrElseEmpty[T]): goe.U = goe.unbox(t)
unbox: [T](t: T)(implicit goe: GetOrElseEmpty[T])goe.U

scala> unbox((1, None: Option[String]))
res4: go.U = (1,"")

Then we could keep the code exact, but then have a method like:

myPipe.map(unboxOptionString).write(TypedText.tsv("output"))
sid-kap commented 9 years ago

TypedText doesn't correctly handle strings that contain the delimiter. I think we could fix this by using "\"" as the quote delimiter here: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/source/TypedText.scala#L87

I think the Hadoop text reader/writers should take care of the mines involved with quoting and strings that contain commas or newlines.

johnynek commented 9 years ago

@sid-kap can you write some tests that exercise this? I'm in favor of fixing. Do most TSV, CSV and 1-SV parsers understand the " quoting?

sid-kap commented 9 years ago

The "official" CSV format allows " quoting: see https://tools.ietf.org/html/rfc4180#section-2. This allows us to put commas and newlines within quotes.

However, the official TSV format does not allow tabs in the fields, and therefore (I assume) has no need for quoting. (I'm guessing newlines are also not allowed.) See http://www.iana.org/assignments/media-types/text/tab-separated-values, or https://en.wikipedia.org/wiki/Tab-separated_values.

I can't find any documentation on OSV.

Sure, I can write some tests for this. I think property-based tests for roundtrip-ability would be a good idea.

johnynek commented 9 years ago

So, we could by default throw if the user gives us a string containing the delimiter in the case of tsv or osv, and in the case of csv do quoting.

sid-kap commented 9 years ago

That sounds good.

isnotinvain commented 9 years ago

Personally, I don't like the added complexity. I agree, traps are bad, and that's why we don't encourage use of delimted sources.

Given that this format has traps, I think it's much better to keep it incredibly simple, so that the traps are easily understood, instead of bolting on fixes until we've created a very confusing format. There is not official CSV / TSV format, just lots of formats that separate things with commas and tabs. Some do string escaping -- do they all escape in exactly the same manor?

Why don't we just encourge typed-json for human readable string based formats, and leave this one as simple as possible?

sid-kap commented 9 years ago

There is a MIME types for text/csv which has guidelines on how to escape strings with quotes. If we follow the MIME specification, there shouldn't be any ambiguity.

If we don't want to add support for quotes, the simple solution would be to fail on strings that contains a comma, newline, or double-quote.

johnynek commented 9 years ago

The thing about TSV/CSV is that they are highly portable. The main reason we didn't like them in the past is that there was not a good schema, but this typed code requires you to write a schema and it fails if the data is bad. It is hard to see how this is much different than thrift to me (except in thrift the IDL language is not the language you are writing, which is hardly a feature).

Given that excel, R, python, etc... all consume and produce TSV/CSV with minimal effort, I think it is good to support it as correctly as possible, and indeed in many cases it is a fine choice (assuming you have a clear schema, which we do here).

I don't like complexity for the user, but I don't mind (a certain amount) of complexity for the system implementer: that's the goal. Do heavy lifting once, and then make an API that is easy to use correctly. If we got this right, TSVs would be much safer to use without ambiguity, and I think that would be a win.

isnotinvain commented 9 years ago

Yes, but I think it's worth verifying that R, python, pig, hadoop's i/o formats, and others actually use the spec described there.

Gabriella439 commented 9 years ago

I still prefer to have the user make the conversion from Option[String] to String explicitly instead of doing it magically for them. Even experienced programmers make mistakes, and the worst thing about silent conversions is that they are silent! The thing that is most pernicious about this class of mistake is that users won't even be able to estimate how often they make this mistake until they work for some period of time in an exact and precise setting:

@csaid and @joshualande may have made this mistake already without even realizing that they made it, which is not to imply that they are bad data scientists (I think they are both great!), but rather to imply that everybody makes mistakes when given inexact tools to work with.

To quote Dijkstra:

Instead of regarding the obligation to use formal symbols as a burden, we should regard the convenience of using them as a privilege: thanks to them, school children can learn to do what in earlier days only genius could achieve.

joshualande commented 9 years ago

I think this was discussed above, but is the any chance we can make TypedTsv round-trip safe, but add a second write-only source that writes out the data for the easiest processing in R/python?

sid-kap commented 9 years ago

@isnotinvain I tested reading and writing CSVs in R and the python pandas library, and the quoting style looks consistent with the CSV spec I linked to above.

Pig's CSVLoader docs also supports the same quoting style (according to http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/CSVLoader.html).

Unfortunately, Hadoop's TextDelimited scheme doesn't properly support newlines in the middle of fields, but there is a Hadoop CSV scheme that someone wrote that handles this properly (https://github.com/datascienceinc/cascading.csv). We could add this as a dependency and use this instead of TextDelimited.

ianoc commented 9 years ago

Adding a dependency on another project is something we'd be pretty slow to do, lots of maintainance down the track. If its licensed permissively enough and has short enough source it might make sense to copy it in tho.

ianoc commented 9 years ago

It looks like it has dependencies on apache commons too. Doubt we'd want that in scalding core. I guess it starts becoming a question of do we want these edge cases covered enough for all the work likely here?

johnynek commented 9 years ago

I made support for nesting and removed the ability to read Option[String] here: #1387

joshualande commented 8 years ago

In case it helps anybody else looking through this, you can convert from an Option to a String that can be easily read into R using the mkString function, i.e.

scala> Option(1L).mkString 
res0: String = 1

scala> None.mkString 
res1: String = ""

This is handy if you need to convert the Options before saving to Csv/Tsv, i.e.


case class Data(name:String, value:Option[Long], country:String)
val data = List(
  Data("Fred",Some(1L),"US"),
  Data("Sally", None, "CA")
)

val results = {
  TypedPipe.from(data)
    .map { case Data(name, value, country) => (name, value.mkString, country) }
    .save(TypedCsv("test.csv"))
}

Writes out

$ hadoop fs -cat test.csv/*
Fred,1,US
Sally,,CA