rug-compling / alpinocorpus

Library for handling Alpino corpora
GNU Lesser General Public License v2.1
8 stars 1 forks source link

Lazy opening of corpora #29

Closed jelmervdl closed 11 years ago

jelmervdl commented 12 years ago

For the DirectoryCorpusReader this would be implementable.

Currently, Dact tries to open each argument supplied through ac::CorpusReaderFactory::open and adds each CorpusReader instance to a MultiCorpusReader.

One way to implement lazy opening of corpora would be to extend the MultiCorpusReader with an additional constructor. One could supply the paths instead of the corpora and the reader would open and close the corpora when needed. This would change the public interface of MultiCorpusReader but is probably the easiest way to implement it.

Another method might be to add additional methods to the CorpusReader interface to signal it is no longer actively used. The MultiCorpusReader could then signal the previous CorpusReader that it won't query it for some time, and start querying the next one. The public interface would remain the same, but the behavior of CorpusReader instances would slightly change. Corpora would no longer be opened as soon as an instance of CorpusReader is created, and as a result the error checking and even the try-catch statements of ac::CorpusReaderFactory::open would fail their purpose.

danieldk commented 12 years ago

I am all in favor of the first approach. A lot of time is spent opening corpora, when you open a directory with hundreds of .dact corpora, while the user expects immediate feedback. We can break the interface for this in the next major version.

I have some other ideas that I'd like to try, that would also break the interface ;).

danieldk commented 12 years ago

We now have lazy Multi/RecursiveCorpusReaders.

Todo: the first corpus is now opened for query validation. Maybe we want to keep a corpus open for validation? Or follow a different approach completely? (Since the first corpus may support a different subset than other corpora...)