jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.48k stars 631 forks source link

Eclipse support... #274

Closed hallvard closed 4 years ago

hallvard commented 6 years ago

I'd like to discuss what kind of Eclipse support would be interesting. I'm both thinking of easing use of Tablesaw in Eclipse plugins for developers, and viewing, analysing and editing data for users (data analysts), e.g.:

My personal interest is using tablesaw in a kind of workbench for learning analytics for my Java course at NTNU (google "wiki tdt4100"), but I would guess most of what I will be doing could have general interest, if made generic enough. I know how to do all of the above, but would like to discuss what will provide the most benefit.

lwhite1 commented 6 years ago

This sounds cool. I would be happy to discuss, but I'm an IDEA user so I won't get all the details. @benmccann uses Eclipse, though and might have useful input.

In any case, let me do some homework on the tools you mention so I can understand them better, and then we can chat.

benmccann commented 6 years ago

@hallvard another project you may be interested in is beakerx, which has good tablesaw support

lwhite1 commented 6 years ago

@hallvard @benmccann makes a good point about beakerx, given the ubiquity of Jupyter. When I first looked at the project it was an alternative to Jupyter, rather than building on it. As Jupyter compatible it's a lot more sustainable.

That said, there is a recent thread on hacker news about the pros and cons of Jupyter versus the Mathematica notebook, mainly around the limitations of working in a browser. https://news.ycombinator.com/item?id=16840692

I have always been underwhelmed with what I've seen in the browser-based notebooks. As someone wrote in the HN thread, iPython was developed as a better REPL, which is great, but not a very high bar. For what it's worth, I do my analysis work in Intellij IDEA, rather than in a notebook. If I wanted to share the results widely, I might copy the finished analysis to Jupyter/beakerx for sharing.

The three critical elements of a good environment would need a table editor, a workspace with good support for code completion and syntax highlighting, and a mechanism for displaying plots.

With regard to plots: Tablesaw is almost certainly going to move soon from the current native java plots to javascript based plots, probably built on plot.ly. So the visualization component would need to be the desktop browser or an embedded browser.

With regard to the workspace: I've always been envious of the expressiveness of array languages like APL, but I would not create a new scripting language, although that sounds like fun. I would wrap Tablesaw in Kotlin. I'm told the Kotlin support in Eclipse is not great, so that could be an issue, but Kotlin is easy for Java devs to learn, has good support for functional programming, and supports operator overloading so we could do columnwise addition as c1 + c2, etc. So I think a Kotlin wrapper could close gap with languages like R or Julia that are specifically designed for analysis.

If I were building something from scratch, I might consider modeling the workspace as a WYSIWYG Markdown editor like Typora, where the code in the code blocks was executable. This is basically the notebook idea, supporting something like Knuth's vision of Literate programming.

The main drawbacks to Kotlin are that it doesn't support array index overriding, so you can't say columnX[4], to get the 4th item, and more generally that it inherits the weaknesses of java's generics. That's the price of excellent Java interoperability though.

Regarding the table editor, I don't really have any real suggestions. There is a Java FX Table implementation in Tablesaw's plot module, and Plot.ly does HTML table output. I'm not sure if either helps you move forward.

hallvard commented 6 years ago

My goal is not to make something that will replace R, but allow you to do a bit more within Eclipse before exporting data to a generic format like CSV and continuing in R (if needed). Over time, the data analysis power of tablesaw would improve and more could be done within Eclipse.

The current table editor can easily be embedded in Eclipse, provided that tablesaw itself is embeddable. There is also a CSV table editor that may be used as a starting point (haven't checked the license). So the first step would be to make tablesaw OSGi-friendly, which means solving some dependency and/or packaging issues.

The next question is what operations (filtering, transformations etc.) that such an editor should allow, including spawning new table editors, so you get a flexible environment.

The third issue is making it easier to write tablesaw code. Eclipse has its own Kotlin-like language called Xtend, that supports operator overloading and functional programming. Some library code would make that pretty comfortable.

Plotting in a browser would work, as Eclipse includes embedded browser support.

thomashaselwanter commented 6 years ago

IDEA user here. I like the way IDEA has native support to view a pandas dataframe in Python as a table structure during a debug session. Tablesaw is currently treated like any other object. It sure would be nice to get the same support for Tablesaw in IDEA. Not sure if the best way would be to lobby JetBrains to do it or via a plugin. I guess I'd first try and open a ticket on the IDEA ticket tracker. From experience - it might take a year or longer, but I think there is a good chance JetBrains could support Tablesaw natively as they do for pandas.

hallvard commented 6 years ago

Eclipse allows plugins to support a different and more logical view of certain objects in the debugger's Variables view, e.g. there is such support for ArrayList. But this is outside the scope of what I want to do...

thomashaselwanter commented 6 years ago

I'll move my IDEA support wishlist to a different thread as this one is about Eclipse.

cgrinds commented 6 years ago

The main drawbacks to Kotlin are that it doesn't support array index overriding, so you can't say columnX[4], to get the 4th item

Maybe you mean something else, but Kotlin supports indexed access operators. I use that with a wrapper around Tablesaw so I can do stuff like this:

table["name"] = "Foo"    // this does table.column(key).append(value)
table["elapsed"] = 314
val hasBar = table["name"].contains("bar") 
lwhite1 commented 6 years ago

@cgrinds oh, that's great! I stand corrected.

lwhite1 commented 6 years ago

@cgrinds I'm not sure how much of a kotlin wrapper you've created, but if you're willing to share the code somehow, I'd be interested in seeing it. We had made a start at a kotlin wrapper a while back, but removed it for lack of time

cgrinds commented 6 years ago

At the moment, the wrapper is very ad hoc and specific to my use. I incorporated Tablesaw into my app about a week ago, so I've not had much time with it. That said, it might be useful to discuss areas that I've felt the need to add extension methods since that may point to

My application parses millions of custom log files to find interesting events, aggregates and displays them as ascii tables. Tablesaw is used for the aggregation and display. I suspect this may be a different use of Tablesaw than others, but maybe not...

I'm using HEAD and the new APIs.

It wasn't obvious how to iterate a table's rows. I settled on the following, which uses Kotlin's forEach and the fact that Tables implement IntIterable. This works out of the box with no changes since Kotlin rewrites the [r, c] calls to use Table's get(int r, int c) method. This feels a bit hacky and maybe off the beaten path. Not sure many people want to iterate rows manually. The get signature also converts everything to a String which isn't always what you want and forces you to convert. Maybe the more idiomatic way would be to grab each column and iterate over them individually? This might be a good FAQ candidate.

val table = rTable.sortOn("start").select(
    rTable.stringColumn("name"), rTable.stringColumn("resourceType"))
    .rejectDuplicateRows()

table.forEach {
    val name = table[it, 0]
    val resourceType = table[it, 1]
}

I created some helpers to make adding data to tables easier. Instead of this:

val urlCol = StringColumn.create("url")
val elapsedCol = DoubleColumn.create("elapsed")
val methodCol = StringColumn.create("method")
val otable = Table.create("", urlCol, elapsedCol, methodCol)
for (event in events) {
    urlCol.append(event.url)
    elapsedCol.append(event.elapsed.toDouble())
    methodCol.append(event.method)
}

I do this:

val table = makeTable("s:url", "d:elapsed", "s:method")
for (event in events) {
    table.append("url", event.url)
    table.append("elapsed", event.elapsed)
    table.append("method", event.method)
}

I'm not a huge fan of the magic values in makeTable above. 's' for StringColumn and 'd' for DoubleColumn but it's nice to skip creating the columns individually. In many cases, I never need the columns because so many of table's api's take the column name, but all told this isn't a big difference.

This is a simple Kotlin extension that makes it easier to rename the last column; a common action when summarizing.

Assume you have this:

val countByUrlMethod2 = table.summarize("url", count).by("url", "method")

Out of the box:

countByUrlMethod2.column(countByUrlMethod2.columnCount() - 1).setName("Calls")

Extension

countByUrlMethod.lastColumn().setName("Calls")

Better extension that allows regex and chaining. The chaining is nice so you can skip intermediate assignments

val countByUrlMethod3 = table.summarize("url", count).by("url", "method")
    .renameCol("count", "Calls")

I often have tables where I want to sum several columns and append a total row at the bottom of the table.

Assume you have this:

  url    |  method  |  calls  |   elapsed   |
---------------------------------------------
 /a/b/c  |  Delete  |   12.0  |   206443.0  |
 /a/b/c  |     Get  |   10.0  |   159215.0  |
 /a/b/c  |   Patch  |    9.0  |   144313.0  |

I created a Kotlin extension named addTotalRow that turns the above table into the table below.

urlsByCount.addTotalRow("url", "calls", "elapsed") means append a total row with the word Total appearing in column url totalling columns, calls and elapsed, which results in this:

  url    |  method  |  calls  |   elapsed   |
---------------------------------------------
 /a/b/c  |  Delete  |   12.0  |   206443.0  |
 /a/b/c  |     Get  |   10.0  |   159215.0  |
 /a/b/c  |   Patch  |    9.0  |   144313.0  |
  Total  |          |   31.0  |   509971.0  |

The other extension I created is toAsciiTable so I can do things like: urlsByCount.toAsciiTable(Align.Left, Align.Left) to have more control over number grouping, turning doubles into longs, and alignment (using your new API).

| Url    | Method | Calls |   Elapsed |
|:-------|:-------|------:|----------:|
| /a/b/c | Delete |    12 |   206,443 |
| /a/b/c | Get    |    10 |   159,215 |
| /a/b/c | Patch  |     9 |   144,313 |
| Total  |        |    31 |   509,971 |

Again I'm not suggesting you make changes based on this feedback, but thought you might be interested in how your API was being used.

lwhite1 commented 6 years ago

@cgrinds Thank you very much for the detailed feedback. Some thoughts

It wasn't obvious how to iterate a table's rows. I settled on the following, which uses Kotlin's forEach and the fact that Tables implement IntIterable.

Right now, you're doing it the best way possible. I have a local branch where I'm looking at making table iterate tech.tablesaw.api.Row instead of int. What's in master is flawed, but I think this may be better than int iteration. Row is kind of a single-row slice, so it doesn't convert anything or otherwise create garbage unless it's really needed, but still makes rows a bit more 'real'. This will take a few weeks, I think, as I'm still experimenting. Row does support getting values by type, which as you note is missing in table currently:

    String s = row.getString("colX");
    double d = row.getDouble("colY"); 

etc.

Another use-case for the row object is for allowing comparator based sorts on a table, which is kinda broken in master.

The others I will think about. Better control over column alignment is probably broadly useful, so maybe should be an enhancement.

benmccann commented 6 years ago

The other thing I'd really like to see for iteration is to make stream() available https://github.com/jtablesaw/tablesaw/issues/257

hallvard commented 5 years ago

I've done some work on the Eclipse integration, and here's what I have so far:

lwhite1 commented 5 years ago

It sounds really cool, but I wish it were in Idea :)

benmccann commented 5 years ago

Haha. I'm still an Eclipse fan, but even then I would have guessed a web app to be the most natural way to build anything UI-related these days, so that you can share with others more easily

But sounds cool. I think it'd be really neat to demo in a video at some point