scanny / python-pptx

Create Open XML PowerPoint documents in Python
MIT License
2.4k stars 519 forks source link

Support for parsing Equations #892

Open AM-ash-OR-AM-I opened 1 year ago

AM-ash-OR-AM-I commented 1 year ago

There's #528 issue before that showed how to insert Office Math ML (Equations) text, but I want to know is there any way to parse/extract text? #706 that seemed to have handled it however it still doesn't work for all text: for e.g. in this below extract from slide.xml, it parses "We factorise it as" under <a:r> tag but doesn't not parse "𝑥" under <a14:m> tag.

<mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
  xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main">
  <mc:Choice Requires="a14">
    <p:sp>
      <p:nvSpPr>
        <p:cNvPr id="8" name="TextBox 7" />
        <p:cNvSpPr txBox="1" />
        <p:nvPr />
      </p:nvSpPr>
      <p:spPr>
        <a:xfrm>
          <a:off x="1422400" y="4460458" />
          <a:ext cx="4528458" cy="682046" />
        </a:xfrm>
        <a:prstGeom prst="rect">
          <a:avLst />
        </a:prstGeom>
      </p:spPr>
      <p:txBody>
        <a:bodyPr wrap="square" lIns="0" tIns="0" rIns="0" bIns="0" rtlCol="0" anchor="t">
          <a:spAutoFit />
        </a:bodyPr>
        <a:lstStyle />
        <a:p>
          <a:pPr>
            <a:lnSpc>
              <a:spcPts val="5725" />
            </a:lnSpc>
          </a:pPr>
          <a:r>
            <a:rPr lang="en-IN" sz="4000">
              <a:solidFill>
                <a:schemeClr val="bg1" />
              </a:solidFill>
            </a:rPr>
            <a:t>We factorise it as </a:t>
          </a:r>
          <a14:m>
            <m:oMath xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">
              <m:r>
                <a:rPr lang="en-US" sz="4000" i="1" spc="-229">
                  <a:solidFill>
                    <a:srgbClr val="FFC000" />
                  </a:solidFill>
                  <a:latin typeface="Cambria Math" />
                  <a:ea typeface="Cambria Math" panose="02040503050406030204" pitchFamily="18"
                    charset="0" />
                </a:rPr>
                <m:t>𝑥</m:t>  # Doesn't parse this
              </m:r>
            </m:oMath>
          </a14:m>
          <a:r>
            <a:rPr lang="en-IN" sz="4000">
              <a:solidFill>
                <a:schemeClr val="bg1" />
              </a:solidFill>
            </a:rPr>
            <a:t> =</a:t>
          </a:r>
          <a:endParaRPr lang="en-US" sz="4000" spc="-229" dirty="0">
            <a:solidFill>
              <a:schemeClr val="bg1" />
            </a:solidFill>
            <a:latin typeface="+mj-lt" />
          </a:endParaRPr>
        </a:p>
      </p:txBody>
    </p:sp>
  </mc:Choice>
  <mc:Fallback xmlns="">
    <p:sp>
      <p:nvSpPr>
        <p:cNvPr id="8" name="TextBox 7" />
        <p:cNvSpPr txBox="1">
          <a:spLocks noRot="1" noChangeAspect="1" noMove="1" noResize="1" noEditPoints="1"
            noAdjustHandles="1" noChangeArrowheads="1" noChangeShapeType="1" noTextEdit="1" />
        </p:cNvSpPr>
        <p:nvPr />
      </p:nvSpPr>
      <p:spPr>
        <a:xfrm>
          <a:off x="1422400" y="4460458" />
          <a:ext cx="4528458" cy="682046" />
        </a:xfrm>
        <a:prstGeom prst="rect">
          <a:avLst />
        </a:prstGeom>
        <a:blipFill>
          <a:blip r:embed="rId4" />
          <a:stretch>
            <a:fillRect l="-6729" t="-13393" r="-1211" b="-43750" />
          </a:stretch>
        </a:blipFill>
      </p:spPr>
      <p:txBody>
        <a:bodyPr />
        <a:lstStyle />
        <a:p>
          <a:r>
            <a:rPr lang="en-US">
              <a:noFill />
            </a:rPr>
            <a:t> </a:t>
          </a:r>
        </a:p>
      </p:txBody>
    </p:sp>
  </mc:Fallback>
</mc:AlternateContent>

Is there any way to extract by parsing tree?

bennettbrowniowa commented 4 months ago

Also see issue #947 . I'm interested in this project to extract professors' slides' text for a platform to crowdsource contributions, revision, and reviews of teaching materials in quantum information.