rug-compling / alpinocorpus

Library for handling Alpino corpora
GNU Lesser General Public License v2.1
8 stars 1 forks source link

Cannot cancel opening of corpus #2

Closed jelmervdl closed 12 years ago

jelmervdl commented 13 years ago

Since opening a file happens at the construction of the object, we cannot tell the object to stop opening a file (or query its progress) For the DirectoryCorpusReader this might be really useful. As far as I understand it isn't possbile to implement this in the DbCorpusReader because dbxml does not support it.

Todo: determine the priority of this feature

The problem: Cannot cancel the open file operation, wich can take a lot of time for big corpusses.

Possible solutions:

  1. Kill the thread that opens a corpus when the user presses the cancel button. But this implies an unclean shutdown, with possibly corrupting indexes and leaving open file handles.
  2. Keep the thread running, but ignore all its signals. This will allow the thread to stop cleanly, but doesn't free up the resources used while opening the corpus. Also I think this is harder to implement in Dact.
  3. Lazy loading. On construction of the corpusreader, don't just yet start loading it, but delay this action till an iterator is requested. This way we have an object which we can tell to stop (using a close() method or something)
  4. Separate open() from the constructor, add a close() method to the reader which can be called while open() is busy to cancel open() or any other action and invalidate the object.
  5. Some sort of progress indicator which we pass to the constructor. The constructor queries this indicator while opening the corpus. From Dact, we can tell the indicator to set its status to 'stop', which the reader can query, and stops reading. But this feels like a hack and leaves us with an unitialized object.

I think 3 or 4 are the best options. I think I would prefer 4 above 3 because it is less magical.

larsmans commented 13 years ago

I concur that (1), (2) and (5) aren't proper solutions.

(3) is possible. The only issue is that it makes DirectoryCorpusReader::size() an O(n) operation, unless we store the number of entries somewhere on disk (changing the format). I'll start working on this.

I oppose (4) because it violates the RAII principle and complicates the CorpusReader logic greatly because it introduces signals into the backend library.

larsmans commented 13 years ago

There's a new branch called lazydirs that implements option 3. I can succesfully open and view a directory treebank with it.