obsidianmd / obsidian-clipper

Highlight and capture the web in your favorite browser. The official Web Clipper extension for Obsidian.
https://obsidian.md/clipper
MIT License
1.01k stars 45 forks source link

BUG: Cannot web-clip contents with math expressions #121

Open ziyuang opened 2 weeks ago

ziyuang commented 2 weeks ago

Version (please complete the following information):

Describe the bug Obsidian-clipper doesn't extract LaTeX expressions from a webpage well.

To reproduce

  1. Go to L-space of a statistical experiment
  2. Web-clip the page

Expected behavior The LaTeX expressions are saved and the math content are correctly rendered. For the block between b) and c) in the page, this will be

$$
L^{′} \left( \mathcal{E} \right) = \left\{ \mu \in ca \left( \Omega , \mathcal{F} \right) : \left| \mu \right| \leq \sum_{i = 1}^{n} \alpha_{i} P_{i} \right.
$$

or Screenshot 2024-11-03 021143

Actual behavior The expression below is saved:

$$$
L^{′} \left(\right. \mathcal{E} \left.\right) = \left{\right. \mu \in ca \left(\right. \Omega , \mathcal{F} \left.\right) : \left|\right. \mu \left|\right. \leq \sum_{i = 1}^{n} \left(\alpha\right)_{i} P_{i}
$$$

or

image

The main problem is that \left{ should have been \left\{. There are also minor issues

Your template file default-clipper.json obsidian-web-clipper-settings.json

ziyuang commented 2 weeks ago

It looks like obsidian-clipper uses @mozilla/readability, but I don't see huge problems in Firefox's Reader View (which uses the same library):

Screenshot 2024-11-03 023750

kepano commented 1 week ago

The problem is in the conversion to Markdown not in Readability

ziyuang commented 1 week ago

So it looks for the <math> node and converts the node to LaTeX expression with mathml-to-latex:

const mathElement = assistiveMml.querySelector('math');
if (!mathElement) {
    return content;
}

let latex;
try {
    latex = MathMLToLaTeX.convert(mathElement.outerHTML);
} catch (error) {
    console.error('Error converting MathML to LaTeX:', error);
    return content;
}

For example, the <math> node for the first equation in the page looks like this

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mrow class="MJX-TeXAtom-ORD">
    <mfrac>
      <mrow> <mi>d</mi> <mi>P</mi> </mrow> <mrow> <mi>d</mi> <mi>μ</mi> </mrow>
    </mfrac>
  </mrow>
  <mo>∈</mo>
  <msub>
    <mi>L</mi> <mrow class="MJX-TeXAtom-ORD"> <mn>1</mn> </mrow>
  </msub>
  <mo stretchy="false">(</mo> <mi>μ</mi> <mo stretchy="false">)</mo>
</math>

I would use this as a plan B, because sometimes the corresponding LaTeX expression appears in a nearby <script> node. In this case, it is

<script type="math/tex; mode=display" id="MathJax-Element-6">
  { \frac{dP }{d \mu } } \in L _ {1} ( \mu )
</script>
ziyuang commented 1 week ago

Oh, does Readability strip off the script block already?

Cortys commented 4 days ago

On this page (using MathJax 3), there is a related, but slightly different, issue. Here, the math expressions are ignored entirely.

ziyuang commented 4 days ago

On this page (using MathJax 3), there is a related, but slightly different, issue. Here, the math expressions are ignored entirely.

Also the figures are broken. For example an image <img src="DDPM.png" style="width: 100%;" class="center"> is converted to ![](https://lilianweng.github.io/DDPM.png), but in fact it should be ![](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/DDPM.png)

ziyuang commented 4 days ago

On this page (using MathJax 3), there is a related, but slightly different, issue. Here, the math expressions are ignored entirely.

I also tried with mathml-to-latex's playground (better change v1.3.0 to v1.4.2). The <math> blocks in the page are convertible to LaTeX.

Maybe something upstream (Readability?) is off.