jdmartin / eltec-text-splitter

Chunk English Novels Into Chapters
MIT License
1 stars 0 forks source link

TEI header information not included, files not placed in single bucket #4

Open tarahmarie opened 2 years ago

tarahmarie commented 2 years ago

Sample output from philologic load:

`1920-ENG19200—Lawrence-chapter_24 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_25 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_26 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_27 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_28 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_29 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_3 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_30 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_31 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_4 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_5 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_6 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_7 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_8 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_9 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_1 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_2 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_3 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_4 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_5 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_6 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_7 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_8 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_9 has no valid TEI header or contains invalid data: removing from database load... Sun Aug 28 01:58:26 2022: Parsing document level metadata: 3529/3529 done... Sun Aug 28 01:58:26 2022: Sorting files by the following metadata fields: year, author, title, filename...

Parsing files

Sun Aug 28 01:58:26 2022: parsing 0 files. Sun Aug 28 01:58:26 2022: done parsing

Merge parser output

Sun Aug 28 01:58:26 2022: sorting words Sun Aug 28 01:58:26 2022: Merging words in batches of 1000...

Sun Aug 28 01:58:26 2022: Merging all merged sorted files (this may take a while)... ^Cwords sorting failed Interrupting database load...`

And files are put into book folders instead of one top-level folder. I've put them by hand into a single folder (the script I wrote borked a system directory so I quit that and just did by hand). Ideas? Thoughts?

jdmartin commented 2 years ago

Sorry, was mirroring the structure of the Gutenberg splitter, as I thought this was what is needed. I can change that. Adding TEI headers is easy enough, as long as the minimum (cf. Gutenberg splitter) is all that is required?

I'll give this a look, later tonight.

tarahmarie commented 2 years ago

It's both, actually. For splits, they need to be in individual book folders. For the bucket over in sequence alignment, they must be at top level. The splits calculated ngrams and hapaxes gorgeously; took about 1.5 hours for all 100 books to run. But sequence alignment's been a bit stickier. I ended up borking a serious folder with a misplaced . in a shell script :-) so it took a bit to unbork (which also involved reconfiguring a NAS to permit wireless Time Machine restoration since that was missing) anyway, lots of yak shavings on the ground around here.

jdmartin commented 2 years ago

Addressed by #5. The notes for that are:

  • Basic TEI headers are now generated.
  • Folder structure changed.
    • output/splits breaks down by work, with clean chapters enclosed.
    • output/bucket has all books, all chapters in clean form.
    • output/tei_splits breaks down by work, with TEI wrapper around clean text.
    • output/tei_bucket has all books,all chapters with TEI wrapper around clean text.

Due to inconsistencies in the source, the and <author> may contain extra or even incorrect information. If these are required for anything other than unique identifiers, then data cleaning will be necessary.</p> </blockquote> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/tarahmarie"><img src="https://avatars.githubusercontent.com/u/1920330?v=4" />tarahmarie</a> commented <strong> 2 years ago</strong> </div> <div class="markdown-body"> <p>No errors being thrown as I initiate the philologic DB load; will be several hours or a day til full results in from comparison plus basic stats. Thank you! I will mark issue resolved once sequence alignments generate w/manageable or no errors</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>