numerique-gouv / impress

MIT License
13 stars 8 forks source link

⚗️ Extract text from base64 yjs document #270

Open AntoLC opened 1 month ago

AntoLC commented 1 month ago

Purpose

To be able to index our documents we need a way to extract the text of them. We created a function to extract text from base64 yjs document.

In detail:

    # I wrote "Hello world" in the blocknote editor
    # This is the base64 string of the Yjs document saved in Minio
    base64_string = (
        "ARCymr/3DgAHAQ5kb2N1bWVudC1zdG9yZQMKYmxvY2tHcm91cAcAspq/9w4AAw5ibG9j"
        "a0NvbnRhaW5lcgcAspq/9w4BAwlwYXJhZ3JhcGgHALKav/cOAgYEALKav/cOAwFIKACy"
        "mr/3DgINdGV4dEFsaWdubWVudAF3BGxlZnQoALKav/cOAQJpZAF3DmluaXRpYWxCbG9j"
        "a0lkKACymr/3DgEJdGV4dENvbG9yAXcHZGVmYXVsdCgAspq/9w4BD2JhY2tncm91bmRD"
        "b2xvcgF3B2RlZmF1bHSHspq/9w4BAw5ibG9ja0NvbnRhaW5lcgcAspq/9w4JAwlwYXJh"
        "Z3JhcGgoALKav/cOCg10ZXh0QWxpZ25tZW50AXcEbGVmdCgAspq/9w4JAmlkAXckMTFj"
        "YTgzYmEtZGM3OS00N2Q3LTllNzYtNmM4OTQwNzc1ZjE3KACymr/3DgkJdGV4dENvbG9y"
        "AXcHZGVmYXVsdCgAspq/9w4JD2JhY2tncm91bmRDb2xvcgF3B2RlZmF1bHSEspq/9w4E"
        "C2VsbG8gd29ybGQgAA=="
    )
    decoded_bytes = base64.b64decode(base64_string)
    uint8_array = bytearray(decoded_bytes)

    d1 = Y.YDoc()
    Y.apply_update(d1, uint8_array)
    blocknote = str(d1.get_xml_element("document-store"))

    # blocknote var will look like this:
    # <UNDEFINED>
    # <blockGroup>
    #     <blockContainer "backgroundColor"="default" "id"="initialBlockId" "textColor"="default">
    #         <paragraph "textAlignment"="left">Hello world </paragraph>
    #     </blockContainer>
    #     <blockContainer "id"="11ca83ba-dc79-47d7-9e76-6c8940775f17" "backgroundColor"="default" "textColor"="default">
    #         <paragraph "textAlignment"="left"></paragraph>
    #     </blockContainer>
    # </blockGroup>
    # </UNDEFINED>

    # BeautifulSoup is used to extract the text from the previous structure
    soup = BeautifulSoup(blocknote, "html.parser")
    soupValue = soup.get_text(separator=" ").strip()

    assert soupValue == "Hello world"