htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

feature request: __getitem__ (or similar) for page ranges #8

Closed senderle closed 4 years ago

senderle commented 7 years ago

I've been tinkering around with the feature reader -- it's very pleasant to use. One thing I miss is the ability to access individual pages by index. I'd be happy to write something but I thought I'd make it an official issue for the purpose of discussion.

I imagine the reason this doesn't yet exist is that it's not 100% clear what the best API is -- am I correct about that? I was about to write up a simple __getitem__-style accessor, when I realized that pages() is a generator, and so there's something potentially inconsistent about using __getitem__ and returning ordinary slices, in which the page objects are all loaded in advance.

It may be that this would add too much complexity for the moment, and I'll just have to settle for isliceing the pages generator. Let me know your thoughts!

organisciak commented 7 years ago

Thanks for bringing this up. I've been giving your suggestion a lot of thought.

I'd love to hear more about your experience? What type of use case do you have for grabbing individual pages? Do you need the Page objects, or are you just interested in a specific feature, like token counts?

As we think about how to implement your request, you can hack around it as follows:

page = Page(vol._pages[n], vol)

I've been trying to move away from overly nested thinking about the design. The library used to have 4 objects, a Volume was a group of Pages was a group of Sections was a group of Tokenlists. It now has only Volumes and Pages, and anything you can get in a Page is accessible in a Volume.

This is why I'd love to hear about your use, and what you need when you say 'page'. It is trivial to prepare something akin to vol.pageslice[0:10] that initializes Page objects on-demand (as above), but I can't decide on whether thinking about 'pages' is again overly granular.

The current best way to get 'pages' is to get the features you want for the volume and filter: e.g. "get these pages of this feature" rather than "get this feature from these pages". For tokens you can do something like

vol.tokenlist().loc[301:305,]

This refers to the page number as saved in the book data, so it isn't zero-indexed like the Page() approach above.

If you only want one page, there's a cost to initializing the full volume token list vs. just a page, but that process has been optimized enough that it is very small.

senderle commented 7 years ago

I see that the tokenlist approach would probably satisfy my needs, and so I might simply be bringing mistaken assumptions to the API (since I'm just getting started).

The request stems from a desire to throw out pages from works of fiction that aren't "body" pages -- in the usual (but slightly ill-defined) sense. I've set up a heuristic to identify pages from works of fiction that are likely to be part of the body of the text. I then want to calculate word counts over n bins of body pages. My first instinct was to slice a sequence of pages but I see there's reason to take a different approach.

That being said, I actually started tinkering around with a LazySequence class that uses the ABC Sequence protocol to behave like a range object with a built in base sequence and map function. It uses weak references to cache map output.

The idea is that you can pass an index or slice to one of these objects, and get either an item (the output, in this case, of Page(self._pages[ix])) or a new LazySequence that behaves exactly as if it had been sliced from a list, but without loading all the pages up front. Each LazySequence comes with its own WeakValueDictionary that holds on to previously calculated values for just as long as they're in use somewhere. If they get garbage-collected, subsequent access to the value will re-create them.

It's possible this is a bit over-engineered. But I'm surprised how easy it has been; I'm just using a bunch of ordinary, built-in features of Python and combining them in straightforward ways.

senderle commented 7 years ago

Following up, two notes:

1) For this to work well, I think we'd want a sensible definition of __eq__ for pages. (To decide the result of __contains__.)

2) My instinct is that "pages" isn't so much a different level of type granularity as it is a composition of Page and Sequence types. Whereas now, the pages attribute of a Volume is the composition of Page and Iterator types.

senderle commented 7 years ago

This was an interesting enough problem that I went ahead and wrote up the code. I wrote a couple of tests for correctness. Not sure about performance!

https://github.com/senderle/lazysequence

organisciak commented 7 years ago

Thank you. I'll take a look on Wed, apologies for the delay.

On Fri, Feb 17, 2017, 6:08 AM Jonathan Scott Enderle < notifications@github.com> wrote:

This was an interesting enough problem that I went ahead and wrote up the code. I wrote a couple of tests for correctness. Not sure about performance!

https://github.com/senderle/lazysequence

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/htrc/htrc-feature-reader/issues/8#issuecomment-280644426, or mute the thread https://github.com/notifications/unsubscribe-auth/AADj7wuOnxyCC40Sl74HTMoq1ul4lR4Kks5rdZu0gaJpZM4MCiMD .