mfarragher / obsidiantools

Obsidian tools - a Python package for analysing an Obsidian.md vault
Other
402 stars 28 forks source link

Text goes missing even though the HTML is OK (html2text parsing issues) #21

Closed mfarragher closed 1 year ago

mfarragher commented 1 year ago

For one of my notes with a mix of tables, LaTeX, lists & code blocks, there is a lot of text from the note that isn't captured in source_text_index, but is kept in the HTML. This suggests some parsing issues with how html2text is configured.

Whole paragraph blocks & headers can be completely missing.

This starts to happen after a table with LaTeX. Anything in body text (<p>) afterwards is missing, yet it keeps all the remaining LaTeX (even the stuff in tables).

Perhaps it doesn't like MathJax? Maybe wiping out a few tags from HTML, for the source_text functionality, before it gets processed by html2text could make the output smoother in this case.

Need to think more about:

mfarragher commented 1 year ago

Test case: https://github.com/mfarragher/obsidiantools/commit/d26299ee4cba8c75c5dcd489fd46bf3f44456834

Text gets cut off after the LaTeX equation: source-text-bug

It's one of a few where I use \tag and \label - the rendering in Obsidian breaks sometimes.

mfarragher commented 1 year ago

Test implemented: https://github.com/mfarragher/obsidiantools/commit/ae894a18c595ee3dcbb7a8a36de512183f9a8925

This is fixed now.