Open pickettj opened 3 years ago
Here's what we got. For the loading and parsing, I might just go with a list of lines initially? Here's my implementation that makes use of the walrus operator
pahlavi_corpus = {}
for name, src in pah_xml_corpus.items():
tree = BeautifulSoup(src)
paras = tree.find_all("w:p")
document = [t for p in paras if len(t := p.get_text()) > 0]
pahlavi_corpus[name] = document
If you're feeling feisty, you can even implement it in "one" line
pahlavi_corpus = {
name: [
t for p in BeautifulSoup(src).find_all("w:p") if len(t := p.get_text()) > 0
] for name, src in pah_xml_corpus.items()
}
Things got a little wild with the line parser, but here's my best shot
num_pattern = re.compile(r'^(.*(?:\.[0-9]{1,3}){1,3})?(.*)')
pahlavi_corpus_lines = {}
for name, doc in pahlavi_corpus.items():
match = [ret.groups() for text in doc if (ret := num_pattern.match(text)) is not None]
if all(num is None for num, text in match):
# this doc doesn't use line numbers, make our own
pahlavi_corpus_lines[name] = {str(i+1): text for i, (_, text) in enumerate(match)}
continue
segment = {}
para, line = None, None
for i, (num, text) in enumerate(match+[('-end-', '')]):
if num is not None:
if i > 0:
store = f'-start-' if para is None else para
segment[store] = line
para, line = num, None
if line is None:
line = text
else:
line += '\n' + text
pahlavi_corpus_lines[name] = segment
This tries to detect if there are no line numbering tags in the document. You could also set some kind of cutoff here.
@iamlemec One issue of the bat is that the walrus operator is for Python 3.8, I believe, but Jupyter comes with 3.7. Better to try to get the 3.8 Python version running on Jupyter (still beta, right?), or better to rewrite it without the walrus operator?
Ah, right. Actually 3.8 is released! Python 3.9 is beta right now (natch I'm running that). You should be able to upgrade to 3.8. I feel like regex is actually the place where walrus is the most useful because of the return structure. I'd suggest upgrading. You'll also get "self-documenting" f-string printing, like f'{expr=}'
.
Main lingering issue is that current line extraction code can't handle texts that lack manual line enumeration, loses all of the text.
https://github.com/pickettj/pahlavi_digital_projects/blob/master/Corpus_Builder_Pahlavi.ipynb
@iamlemec I got the flat hierarchy and csv export working: thanks again!