piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Add more data sanity checks #74

Open piskvorky opened 12 years ago

piskvorky commented 12 years ago

There's been a steady trickle of reports that LSI/LDA misbehave, produce degenerate models, crash Python etc.

Typically this is a user data problem (bad input data, feature id mismatch, ...), but since gensim targets the wide general public, this is gensim's "fault" anyway.

Create utility functions that perform basic sanity checks on user's input data:

  1. check that the all feature ids in a corpus are compatible with the user-provided dictionary (should avoid issues like http://projects.scipy.org/scipy/ticket/1582 )
  2. check that the data range is valid -- look for NaNs, Infs, explicit zeros => these are all illegal in gensim input.
  3. check that the data is not degenerate => all vectors identical/empty/?/model looks weird
  4. ~check corpus type and warn the user if it's plain list (promote the memory-friendly generator interface, shown in tutorials)~ NOT NEEDED
prakhar2b commented 7 years ago

@tmylk I would like to work on this. @piskvorky the link above is broken, could you brief me what it was about ?

piskvorky commented 7 years ago

@prakhar2b I think it was a scipy crash (segfault) when using sparse arrays and indexing an element out-of-bounds.

I wouldn't say this issues is "easy" -- it will need some careful thinking and planning. We definitely don't want to slow down processing too much, by (for example) requiring an extra data pass just to check for bad values.

rasto2211 commented 7 years ago

I would like to work on this issue. Could you please give me some pointers to the code where to start?

tmylk commented 7 years ago

@rasto2211 Adding a warning to LdaModel.init when the input is a list is a good way to start ( item 4 above)

rasto2211 commented 7 years ago

@piskvorky @menshikh-iv Do you also want to close this issue since you closed my PR without merging?

menshikh-iv commented 7 years ago

@rasto2211 No, because the remaining points are important (see Radim comment)

menshikh-iv commented 6 years ago

Be attentive with #1732, I already see exactly same problem twice