Closed milanterlunen closed 7 years ago
come to think of it, this becomes especially crucial if we want to deal with a Sacred Corpus consisting of many texts (e.g. if studying a corpus of short poems, or all the books by e.g. Foucault).
Let's definitely do this. This summer, I wrote chapterize, which has been successfully tested on Middlemarch, so I already have the chapters divided. (It doesn't handle the preface and epilogue yet, though, so we'll have to add those manually.)
Paragraphs shouldn't be too hard, but I don't immediately know of a way to do that. I just found this StackOverflow answer, though, which seems promising:
so that we can do e.g. most-quoted paragraphs/chapters? para would be a useful sized unit (which moreover has textual reality) in between the unit of quotation and that of our various sized "chunks", the smallest of which will be >1000 words. chapter is also useful as a unit with textual reality.
long-term, if we're imagining this as a general tool, could we build in a system for recognising any and all textual subdivisions (e.g. "part", "book", "volume", "serial instalment", "scholium", "axiom")? potentially would need some user input to specify categories and their hierarchical relation, but presumably easy to automate.