Open AntoLC opened 1 month ago
To be able to index our documents we need a way to extract the text of them. We created a function to extract text from base64 yjs document.
# I wrote "Hello world" in the blocknote editor # This is the base64 string of the Yjs document saved in Minio base64_string = ( "ARCymr/3DgAHAQ5kb2N1bWVudC1zdG9yZQMKYmxvY2tHcm91cAcAspq/9w4AAw5ibG9j" "a0NvbnRhaW5lcgcAspq/9w4BAwlwYXJhZ3JhcGgHALKav/cOAgYEALKav/cOAwFIKACy" "mr/3DgINdGV4dEFsaWdubWVudAF3BGxlZnQoALKav/cOAQJpZAF3DmluaXRpYWxCbG9j" "a0lkKACymr/3DgEJdGV4dENvbG9yAXcHZGVmYXVsdCgAspq/9w4BD2JhY2tncm91bmRD" "b2xvcgF3B2RlZmF1bHSHspq/9w4BAw5ibG9ja0NvbnRhaW5lcgcAspq/9w4JAwlwYXJh" "Z3JhcGgoALKav/cOCg10ZXh0QWxpZ25tZW50AXcEbGVmdCgAspq/9w4JAmlkAXckMTFj" "YTgzYmEtZGM3OS00N2Q3LTllNzYtNmM4OTQwNzc1ZjE3KACymr/3DgkJdGV4dENvbG9y" "AXcHZGVmYXVsdCgAspq/9w4JD2JhY2tncm91bmRDb2xvcgF3B2RlZmF1bHSEspq/9w4E" "C2VsbG8gd29ybGQgAA==" ) decoded_bytes = base64.b64decode(base64_string) uint8_array = bytearray(decoded_bytes) d1 = Y.YDoc() Y.apply_update(d1, uint8_array) blocknote = str(d1.get_xml_element("document-store")) # blocknote var will look like this: # <UNDEFINED> # <blockGroup> # <blockContainer "backgroundColor"="default" "id"="initialBlockId" "textColor"="default"> # <paragraph "textAlignment"="left">Hello world </paragraph> # </blockContainer> # <blockContainer "id"="11ca83ba-dc79-47d7-9e76-6c8940775f17" "backgroundColor"="default" "textColor"="default"> # <paragraph "textAlignment"="left"></paragraph> # </blockContainer> # </blockGroup> # </UNDEFINED> # BeautifulSoup is used to extract the text from the previous structure soup = BeautifulSoup(blocknote, "html.parser") soupValue = soup.get_text(separator=" ").strip() assert soupValue == "Hello world"
Purpose
To be able to index our documents we need a way to extract the text of them. We created a function to extract text from base64 yjs document.
In detail: