can we build paragraphs and chapters as units into our tool?

milanterlunen commented 8 years ago

so that we can do e.g. most-quoted paragraphs/chapters? para would be a useful sized unit (which moreover has textual reality) in between the unit of quotation and that of our various sized "chunks", the smallest of which will be >1000 words. chapter is also useful as a unit with textual reality.

long-term, if we're imagining this as a general tool, could we build in a system for recognising any and all textual subdivisions (e.g. "part", "book", "volume", "serial instalment", "scholium", "axiom")? potentially would need some user input to specify categories and their hierarchical relation, but presumably easy to automate.

milanterlunen commented 8 years ago

come to think of it, this becomes especially crucial if we want to deal with a Sacred Corpus consisting of many texts (e.g. if studying a corpus of short poems, or all the books by e.g. Foucault).

JonathanReeve commented 8 years ago

Let's definitely do this. This summer, I wrote chapterize, which has been successfully tested on Middlemarch, so I already have the chapters divided. (It doesn't handle the preface and epilogue yet, though, so we'll have to add those manually.)

Paragraphs shouldn't be too hard, but I don't immediately know of a way to do that. I just found this StackOverflow answer, though, which seems promising:

http://stackoverflow.com/questions/25072167/split-text-into-paragraphs-nltk-usage-of-nltk-tokenize-texttiling

lit-mod-viz / middlemarch-critical-histories

can we build paragraphs and chapters as units into our tool? #15