marklister / product-collections

A very simple, strongly typed, scala framework for tabular data. A collection of tuples. A strongly typed scala csv reader and writer. A lightweight idiomatic dataframe / datatable alternative.
BSD 2-Clause "Simplified" License
144 stars 19 forks source link

Provide support for parsing from scala.io.BufferedSource #11

Closed metasim closed 10 years ago

metasim commented 10 years ago

Recommend providing sibling method to CsvParser.parseFile, (e.g. CsvParser.parse) accepting a scala.io.BufferedSource (or other general input stream) instead of a file name.

marklister commented 10 years ago

Yes, the current parseFile method is a terrible idea. A scaal.io.BufferedSource is really converting a file into an Iterable (as far as I can see). Perhaps we should go for some more low level?

I always found java io terribly complicated so many classes providing tiny bits of functionality.

Any suggestions in providing a simple yet versatile way to plug in the java.io and scala.io ecosystems?

marklister commented 10 years ago

What about a java io Reader given that I'm wrapping opencsv?

metasim commented 10 years ago

Yes, important points. If it were me, I'd take whatever approach minimizes the amount of parsing mess your library has to deal with while maximizing whatever exposed flexibility opencsv has, mediated by how much hard coupling to opencsv you end up with.

I'm looking at the JavaDoc for CSVReader(...), and since it looks like it only supports Reader as an input, I'd suggest exposing just that in addition to the String file name, and let the users figure out the impedance matching.

The brilliance of your library is the tight focus on being really good at one or two things, which, in my opinion, includes not just the powerful type-safe column- and row-oriented operations, but the extensible use of implicit string converters. Providing out-of-the box support for going from text (and perhaps back to text) is important for adoption, and a key feature I need, but I'd stay away from the lower level tokenization process if you can (since that problem has been solved in many other libraries).

I played with several of the other Scala wrappers around CSV parsing, including mighty-csv, scala-csv, and Saddle. Despite my focus on CSV parsing (i.e. how the data is stored), I'd really don't /want/ to care about the parsing. It's the final data representation that comes out of the parser that I care about, and how much work that's required to get there. in product-collections you've hit the ultimate sweet-spot from an idiomatic Scala point of view.

If I can have parsing from something other than providing a file name, it's the 100% perfect solution for me.

Just my 2¢.

marklister commented 10 years ago

Thanks for the feedback. I pushed some changes to master yesterday that allow parsing a java.io.Reader. My (Zimbabwean) internet connection died while I was trying to update the documentation but there's a new test IOSpec.scala that demos the new functionality.

Today I'll probably add support for String -> Option[T] converters where T is one of the common types (Int, Double, String etc)

I'm glad you found product-collections useful. I use it often. I tried Saddle but the something irritated me about the parsing solution and respect for underlying types.

marklister commented 10 years ago

Pushed the updated doc and bumped the version to 0.0.4.3-SNAPSHOT

metasim commented 10 years ago

Fantastic! Thanks so much!