pickettj / pahlavi_digital_projects

Working on a number of projects related to Pahlavi (Middle Persian) texts.
0 stars 0 forks source link

Line Extraction #3

Open pickettj opened 3 years ago

pickettj commented 3 years ago

Main lingering issue is that current line extraction code can't handle texts that lack manual line enumeration, loses all of the text.

https://github.com/pickettj/pahlavi_digital_projects/blob/master/Corpus_Builder_Pahlavi.ipynb

@iamlemec I got the flat hierarchy and csv export working: thanks again!

iamlemec commented 3 years ago

Here's what we got. For the loading and parsing, I might just go with a list of lines initially? Here's my implementation that makes use of the walrus operator

pahlavi_corpus = {}
for name, src in pah_xml_corpus.items():
    tree = BeautifulSoup(src)
    paras = tree.find_all("w:p")
    document = [t for p in paras if len(t := p.get_text()) > 0]
    pahlavi_corpus[name] = document

If you're feeling feisty, you can even implement it in "one" line

pahlavi_corpus = {
    name: [
        t for p in BeautifulSoup(src).find_all("w:p") if len(t := p.get_text()) > 0
    ] for name, src in pah_xml_corpus.items()
}

Things got a little wild with the line parser, but here's my best shot

num_pattern = re.compile(r'^(.*(?:\.[0-9]{1,3}){1,3})?(.*)')

pahlavi_corpus_lines = {}

for name, doc in pahlavi_corpus.items():
    match = [ret.groups() for text in doc if (ret := num_pattern.match(text)) is not None]

    if all(num is None for num, text in match):
        # this doc doesn't use line numbers, make our own
        pahlavi_corpus_lines[name] = {str(i+1): text for i, (_, text) in enumerate(match)}
        continue

    segment = {}
    para, line = None, None

    for i, (num, text) in enumerate(match+[('-end-', '')]):
        if num is not None:
            if i > 0:
                store = f'-start-' if para is None else para
                segment[store] = line
            para, line = num, None

        if line is None:
            line = text
        else:
            line += '\n' + text

    pahlavi_corpus_lines[name] = segment

This tries to detect if there are no line numbering tags in the document. You could also set some kind of cutoff here.

pickettj commented 3 years ago

@iamlemec One issue of the bat is that the walrus operator is for Python 3.8, I believe, but Jupyter comes with 3.7. Better to try to get the 3.8 Python version running on Jupyter (still beta, right?), or better to rewrite it without the walrus operator?

iamlemec commented 3 years ago

Ah, right. Actually 3.8 is released! Python 3.9 is beta right now (natch I'm running that). You should be able to upgrade to 3.8. I feel like regex is actually the place where walrus is the most useful because of the return structure. I'd suggest upgrading. You'll also get "self-documenting" f-string printing, like f'{expr=}'.