piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.38k forks source link

standardize 'corpus_iterable' (over 'sentences') everywhere #3152

Open gojomo opened 3 years ago

gojomo commented 3 years ago

I'd thought that after our discussion on the merits of changing sentences to corpus_iterable during other renaming/refactoring, I'd changed it everywhere. But after this SO question – https://stackoverflow.com/q/67573416/130288 – I see that sentences persists a bunch of places, like the Word2Vec/FastText initializers & some of the doc-comment examples, including the FastText doc-comment – but is not supported elsewhere, such as the build_vocab()/train() methods specifically shown as using it in the doc-comments.

To TLDR that discussion: sentences often confuses users into thinking that they must run a sentences-splitter, as if the algorithms only work on true 'sentences'. It also seems slightly more likely for people to assume that any string-containing-a-sentence is ok, rather than a list-of-tokens. It also is somewhat in-apt with regard to Doc2Vec, or the common use of these algorithms on data that isn't literally natural-language sentences. Using corpus_iterable instead helps re-emphasize that the 'iterable' interface is all-important – a source of many usage erros – and has a useful parallelism to the corpus_file alternative, as only one or the other of corpus_iterable or corpus_file should be provided. (To the extent it then requires someone to look at the doc-comment, rather than assume the type of its individual items, it might also provide a better hook for communicating the requirement that each item be a list-of-tokens.)

I think for consitency all use of sentences should be eliminated, in method signatures & docs, in favor of corpus_iterable.

piskvorky commented 3 years ago

Pity this didn't get in for Gensim 4.0.0. Now we're stuck supporting both, even if we deprecate sentences.

My preference would be nr. 1 corpus (simple, standard), nr. 2 corpus_iterable (more descriptive), nr. 3 sentences (status quo, not consistent / misleading).

mpenkov commented 3 years ago

Pity this didn't get in for Gensim 4.0.0.

There's always 5.0 ;)

gojomo commented 3 years ago

Pity this didn't get in for Gensim 4.0.0.

There's always 5.0 ;)

Yes, incrementing release numbers are free, and should be done for any convenience.

But also: Gensim hasn't historically been completely-rigorous in its semantic-versioning, and it was definitely the intent to remove sentences for 4.0.0 - so its persistence is a bug, not a promise to leave a mistake in indefinitely. So I'd think it OK to correct such oversights in 4.0.2 or 4.1.0.

piskvorky commented 3 years ago

Hmm, how did we all miss this? We had several review cycles, test releases, how come no one spotted this :( In Word2Vec and FastText no less, Gensim's most exposed classes.

Anyway, it does look like a bug. We want corpus_iterable consistently. Or better, corpus (I'm not a fan of variables like count_integer or name_string = putting the type in the name). But I'm always loath to break compatibility, would prefer deprecation… WDYT @mpenkov?

mpenkov commented 3 years ago

Personally, I have no objections about breaking backwards compatibility, if we do it responsibly - using semantic versioning. My preference is for 5.0 over 4.1, because this would be a breaking change, and could annoy people who have already put in the effort to migrate from 3.x to 4.x.

So I'd think it OK to correct such oversights in 4.0.2 or 4.1.0.

If we spotted this shortly after releasing 4.0, then it wouldn't be a problem IMHO. Right now, however, the situation is different. From our POV, it's an oversight, but from the users' POV, it's a breaking change.

gojomo commented 3 years ago

These hypothetical "people who have already put in the effort to migrate from 3.x to 4.x", in a way that's also relying on this parameter name, could be zero in number.

And, if there are any such people, this is the most simple thing to correct: you get an error the 1st time your run under the old name, you look in docs/release-notes and see "use NEWNAME instead of OLDNAME", change the code.

I think one lesson from this & some of the other post-release bugs is that not that many people are stressing these code paths, or racing to update working things, or formally working through the migration guide.

And, even though (as above) I generally think we can and should be liberal with escalating version numbers, a "5.0" so fast after "4.0" would tend to make a bunch of ink-barely-dry docs/commentary mentioning "4.0" (including offsite in places like StackOverflow, and hyperlinks like to the 'migration guide') prematurely confusing, and in need of confusing updates/further-explanation. I see that as a real cost, as opposed to the "someone might be relying on this" possibly-zero cost.

Gensim's compatibility page notes: "Gensim follows "semantic versioning": major.minor.bugfix, though not fanatically. Always read the release notes carefully!"

Jeff-Winchell commented 11 months ago

The entire data science field is littered with HIGHLY USED products that change APIs frequently and have constant build config issues because libraries over-rely on too many other libraries that constantly change APIs and do far too little automated testing. Software Engineering in Data Science sounds like an oxymoron.

That being said, users have no other options. So we just grin and bear it and I report bugs so I do my part towards incremental progress.

gojomo commented 11 months ago

Yes, per my comment on #3501, I'd rather not break anything working:

PRs that ensure the preferred, explicit 'corpus_iterable' & 'corpus_file' are used everywhere (while the 'sentences' doesn't break where it was working in any 4.x) – & docs match, showing 'sentences' as deprecated – would make sense to me.

Regarding @piskvorky's pref for corpus over corpus_iterable, that'd be OK with me, too. Python doesn't lean too much into type-hinting in variable names.

But, corpus_iterable offers both contrastive parallelism with the exclusive-alternative corpus_file, and directly nudges against some highly-recurring user misunderstandings, including:

As an aside, though the performance of corpus_file in multicore situations remains nice, I still dont like some things about it, like:

Aside:

Though I'd be unlikely to ever have time to do it, I sometimes wonder about the potential benefits fresh reimplementation of this functionality in either a streamlined but more-customizable fully-Pythonic fashion. It'd probably relying either on (a) Numba for bulk/multithreaded benefits in lieu of either Cython or C/C++ ; or (b) some new Python-offshoot hotness like ML-optimized 'Mojo'. I believe either could reach or exceed the performance of the current corpus_file code, while being far easier to read, customize, & maintain/enhance.

And, if a single iterable-corpus codepath nears parallelism-competitiveness with the corpus_file special-casing, for example by writing an abstracted version of the corpus (just vocab-indexes) to memory-mappable files during an early pass (vocab-scan or 1st training-pass), then only a single corpus parameter, with a type specified in the doc-comment, would be plenty.

(Maybe some future CodeGPT-7 will find this comment & grind these concepts out during some otherwise idle GPU time. :)