renatopp / liac-arff

A library for read and write ARFF files in Python
MIT License
99 stars 49 forks source link

Future of liac-arff #120

Open mfeurer opened 3 years ago

mfeurer commented 3 years ago

CC @jnothman @ogrisel @renatopp @pgijsbers

Dear all,

I'm opening this issue to start the discussion on the future of liac-arff as the reason why I started working on this project is soon going away. I needed an arff parser to communicate with OpenML.org and started working on liac-arff as it was the best arff-parser around (and probably still is). As OpenML will support other formats than arff (1,2,3) we will drop arff support in the OpenML-Python API, which removes my motivation to maintain this package.

I was therefore wondering how to continue with the package and see the following ways forward:

  1. Moving the work in scikit-learn. As scikit-learn will be the power user of liac-arff once OpenML-Python drops support for arff, @jnothman did most recent contributions and scikit-learn also implements other data readers this looks like the best fit to me.
  2. Someone else takes over the package and @renatopp gives access to that person. However, I'm not sure if he still receives these emails and so I'm not sure if that's a possible way forward.
  3. Moving the work into scipy. liac-arff parser is more feature complete than the one in scipy and according to the benchmarks from @jnothman as fast as the one in scipy.
  4. Create a new github organization.

Looking forward to your opinion, Matthias

jnothman commented 3 years ago

Hi @mfeurer,

Scikit-learn will move to Parquet when supported by OpenML, as far as I'm concerned.

I guess you're suggesting that we could have this as a dataset loader regardless of OpenML support. Several of our decisions about how to turn ARFF into a Scikit-learn friendly dataset are oriented to how OpenML represents categoricals, booleans, etc. So I'm not immediately convinced that we want a decontextualied ARFF reader in Scikit-learn.

I am not convinced that it's worth maintaining in Scipy as a fading technology. If I were the scipy maintainers I'd question its inclusion.

I think creating a new org provides the best opportunity for ongoing maintenance, and I'm okay to manage some of it.

Joel

renatopp commented 3 years ago

Hey! I don't read the discussions here unless my name is invoked :)

I started this reader because, at the time, ARFF and WEKA were great references for ML in the academic context, and, thus, at my lab, we used them a lot. I even had a repository of lots of ARFF datasets that were popular at the time.

It's been ages since I worked with anything related to ML, so I'm completely out of the state-of-the-art in the area, but its been almost 10 years after the initial release of this package and I guess that a lot of things have changed since then. So my main questions are:

ARFF has always been fine for toy problems and small datasets, but I would guess that it is completely outdated for newer and larger datasets. Does it make sense to keep this reader alive? Wouldn't be better to archive it?

Btw, you guys did a great job maintaining this project, you deserve all the kudos. At the time this wasn't possible, but now I can transfer the ownership of the repository if anyone is interested.

mfeurer commented 3 years ago

Scikit-learn will move to Parquet when supported by OpenML, as far as I'm concerned.

Ah, I didn't know that, then there's no reason to move liac-arff into scikit-learn.

Does anyone still really uses ARFF?

AFAIK it's used by OpenML and WEKA and a brief google search showed no other power user. If OpenML moves to parquet the main reason to have an arff parser in Python will go away as datasets will loadable with pandas. However, I just looked at the download stats and they're pretty huge: https://pypistats.org/packages/liac-arff

As a standard, it is still relevant?

Good question, I don't know. I guess Arrow and Parquet are good general-purpose replacements?

Does it make sense to keep this reader alive?

For OpenML-Python we probably need it for another six months or so, not sure about scikit-learn as their deprecation cycle is usually a bit longer.

Btw, you guys did a great job maintaining this project, you deserve all the kudos.

Thanks :)

At the time this wasn't possible, but now I can transfer the ownership of the repository if anyone is interested.

I guess this is the best short-term solution then. How about we create a new org called liac-arff with @renatopp, @jnothman and me as admins so that we preserve the current status quo and can move forward? You should then probably also grant @jnothman and me admin access to pypi?

jnothman commented 3 years ago

The high download count probably reflects ongoing use, but not need for new features. Security fixes and perhaps compatibility fixes are really all that's needed for it to remain usable for current purposes in existing systems.

PGijsbers commented 3 years ago

For some context, the peaks in liac-arff interest seem pretty well explained by openml downloads, but I don't know about the baseline (~5k/day):

image image

Because I am mentioned, I just wanted to say I am not interested in maintaining this package (but thanks to everyone who did/does!).