ropensci / textworkshop18

9 stars 6 forks source link

Interoperability #8

Open kbenoit opened 6 years ago

kbenoit commented 6 years ago

This ought to be very broadly interpreted this year, since we have broadened the group to include Python as well as the previous year's mainly R community. This year, it should include inter-operability of packages with one another, but also with toolkits from one language environment (e.g. Python) with another (e.g. R).

Some discussion of the Text Interchange Format would also be useful, as this was something we developed last year but have left officially unfinished.

jwijffels commented 6 years ago

Was there also something similar as the text interchange format set up for dependency parsing annotation usage (e.g. network or easy structure which allows for nlp applications) during the 2017 workshop?

kbenoit commented 6 years ago

Not that I recall, although we took some inspiration for the TIF from the CoNLL-U format.

brendano commented 6 years ago

Java and MALLET may be interesting. I didn't realize this until recently, but Mallet has its own token-level format it can export (their "save current topic model state" functionality).

vanatteveldt commented 6 years ago

I have my doubts about standardizing on a specific interchange format, as it quickly leads to different groups standardizing on different formats, and often tool X will not work with format Y and format Z won't have feature F that someone needs. Even for R dtm's there are multiple competing standards which I'm sure all have their benefits.

But I'd love to talk about how to ensure that we can all use each other's software and how to make sure we don't invent the same wheel to many times.

brendano commented 6 years ago

A shared corpus data format is kind of a big problem -- there's just no standardized method at all, if you wanted to import into various tools, and if you're writing a new tool, it's not clear what format you want to support. It seems like some people would be happy just with a solution within the R world; I'm curious how that has worked so far.

I have this little JSONL format I keep using for many things, but I have no idea if it's really a general purpose solution.

kyunghyuncho commented 6 years ago

i will attend.

re interoperability: i believe there are many different aspects/levels to this; (1) data, (2) model specification/implementation, (3) learning vs. inference, and so on. for instance, i may want to use a perl script (hypothetically, although i rarely do so) for data preparation, use pytorch to specify, implement and train a model on this prepared data, load it up in R for analysis.