Open tarahmarie opened 2 years ago
Sorry, was mirroring the structure of the Gutenberg splitter, as I thought this was what is needed. I can change that. Adding TEI headers is easy enough, as long as the minimum (cf. Gutenberg splitter) is all that is required?
I'll give this a look, later tonight.
It's both, actually. For splits, they need to be in individual book folders. For the bucket over in sequence alignment, they must be at top level. The splits calculated ngrams and hapaxes gorgeously; took about 1.5 hours for all 100 books to run. But sequence alignment's been a bit stickier. I ended up borking a serious folder with a misplaced . in a shell script :-) so it took a bit to unbork (which also involved reconfiguring a NAS to permit wireless Time Machine restoration since that was missing) anyway, lots of yak shavings on the ground around here.
Addressed by #5. The notes for that are:
- Basic TEI headers are now generated.
- Folder structure changed.
- output/splits breaks down by work, with clean chapters enclosed.
- output/bucket has all books, all chapters in clean form.
- output/tei_splits breaks down by work, with TEI wrapper around clean text.
- output/tei_bucket has all books,all chapters with TEI wrapper around clean text.
Due to inconsistencies in the source, the
and may contain extra or even incorrect information. If these are required for anything other than unique identifiers, then data cleaning will be necessary.
No errors being thrown as I initiate the philologic DB load; will be several hours or a day til full results in from comparison plus basic stats. Thank you! I will mark issue resolved once sequence alignments generate w/manageable or no errors
Sample output from philologic load:
`1920-ENG19200—Lawrence-chapter_24 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_25 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_26 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_27 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_28 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_29 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_3 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_30 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_31 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_4 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_5 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_6 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_7 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_8 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19200—Lawrence-chapter_9 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_1 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_2 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_3 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_4 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_5 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_6 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_7 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_8 has no valid TEI header or contains invalid data: removing from database load... 1920-ENG19201—Arlen-chapter_9 has no valid TEI header or contains invalid data: removing from database load... Sun Aug 28 01:58:26 2022: Parsing document level metadata: 3529/3529 done... Sun Aug 28 01:58:26 2022: Sorting files by the following metadata fields: year, author, title, filename...
Parsing files
Sun Aug 28 01:58:26 2022: parsing 0 files. Sun Aug 28 01:58:26 2022: done parsing
Merge parser output
Sun Aug 28 01:58:26 2022: sorting words Sun Aug 28 01:58:26 2022: Merging words in batches of 1000...
Sun Aug 28 01:58:26 2022: Merging all merged sorted files (this may take a while)... ^Cwords sorting failed Interrupting database load...`
And files are put into book folders instead of one top-level folder. I've put them by hand into a single folder (the script I wrote borked a system directory so I quit that and just did by hand). Ideas? Thoughts?