ispras / atr4s

Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala
Apache License 2.0
34 stars 4 forks source link

Incremental loading support #6

Closed enkiv2 closed 6 years ago

enkiv2 commented 6 years ago

I'm planning to use ATR4S with a very large corpus (tens of millions of documents), wherein new documents are added & old ones are updated on a weekly basis. Since it's not feasible to process the entire corpus at once upon each update, I would like to incrementally add documents to a cached preprocessor and only perform the processing necessary to update statistics, then compute weighted terms for only particular documents.

Is this possible using the current implementations of CachedNLPProcessor & CachedFeatureComputer, or do I need to write another wrapper around these classes to interface with Cacher differently? (I suspect the latter, but I'm not very familiar with scala so I might be missing something.)

astrakhantsev commented 6 years ago

Unfortunately, current implementations don't support this functionality, however, it shouldn't be hard to do that. I'd suggest to create IncrementalCacher with abstract method reduce that would take cached block (e.g. all cached documents or term candidates) and newly processed block and reduce them into one block and then cache it for the next iterations.

For documents reduce would be a trivial concatenation; for term candidates - merging of two maps (given that each term candidate is essentially a map from canonical representation to all occurrences).

As for features, I'd suggest to firstly try simple recomputation, because most of them either take much less time than preprocessing/candidates collection, see table 6, or require most of the time for initialization. Maybe you'll find that 'light' features work best for your data.

enkiv2 commented 6 years ago

I'll look into doing that, thanks!