omnivore-app / logseq-omnivore

Logseq plugin to fetch articles and highlights from Omnivore
MIT License
278 stars 15 forks source link

FeatureRequest/HelpNeeded: highlight is not an exact subset of the text content #179

Open thiswillbeyourgithub opened 4 months ago

thiswillbeyourgithub commented 4 months ago

Hi,

I'm the dev behind LogseqMarkdownParser and am working on a small script to directly turn highlights into anki flashcards.

It's not yet working because I'm running into an issue with text formats.

You see, I don't just want the highlight to be sent to anki, I want to grab the 1000 ish characters before and after the highlight, make a cloze card (= putting a hole in the text and you have to guess the content) with the highlight then sending that to anki.

The main issue I have is that for example I have this highlight: For example, suppose ΔW is the weight update for a weight matrix W∈RA×B. And the relevant section of text is this: For example, suppose \\(\\Delta W\\) is the weight update for a weight ' 'matrix \\(W \\in \\mathbb{R}^{A \\times B}\\).

I'm guessing this is mathjax.

I can't seem to find a good python lib to parse mathjax into text, or text into mathjax, let alone reliably.

So is it possible to:

  1. Either add {{{rawText}}} for the highlight, that would not be parsed (so would still contain the mathjax)
  2. Or parse the content of the article just like the highlight (currently only the highlight is parsed to text)
  3. Also, it seems the position highlight is broken because they are all equal to 0 on my end. Is this normal?

Thanks!

thiswillbeyourgithub commented 3 months ago

Hi ! Just a quick bump as I would really like to wrap up my project while I got some free time :) But if you can't find the time to take a look it totally fine of course!

jacksonh commented 3 months ago

Hi i think what you are seeing in the highlight text is raw text or at least markdown. Can you post a screenshot of the highlight itself?

thiswillbeyourgithub commented 3 months ago

Here's the highlighted section of the text: image

The article link is that one: https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html

thiswillbeyourgithub commented 3 months ago

Hi,

I decided to go the "most robust way" anyway and implement a function that finds the best substring in a corpus that matches the highlight. This is computationaly intensive and probably will be an issue for very long texts but at least I can move on towards finishing this.

When I finish this project, if I think it's worth it I'll come back to you to see if that's worth a mention in a blog post or whatever :)

In the meantime, although I still think my request is legit and someone might have a real need for more precise filter access in the API, I'll let you decide if you want to close this or not :)

Have a nice day!