Closed mp68 closed 4 months ago
Hi, thanks for the kind words :)
I just pushed a new branch : https://github.com/thiswillbeyourgithub/LogseqPDFImporter/tree/fix_nonunique_uuids
I added an argument to specify what to do for non unique UUID. Please do the following:
I added a sanity check that verifies that the annotations are indeeed full duplicate, please report back if you see prints indicating that "Annotations with the same UUID are actually different"
Please report back here so that I can merge this with the main branch if this works fine.
Wow, thank you for your incredible fast reply! I tested the branch and can report the following:
Log output (too large for copy&paste): https://gist.github.com/mp68/4eba668a63d3f9a5b9c95c53cde93592
Thanks for reporting. That showed that I had made a mistake: the UUID was derived from the filename + text, but your annotations all have empty text so ended up having the same UUID.
Can you try again please?
I expect that you will have inded the colored area at the location of the highlight in the pdf, but unfortunately it might indicate also that all your annotations will be devoid of text... If that happens, do try to play with the text_boundary_threshold argument and report back!
To help that, I also added a new print that tells if an annotation has empty text
(Do check that you have the latest commit: d4ac839bd0cc6b10c18cb0d6e14af0d5232d6a29 !)
It's working great for the annotations to be correctly recognised! However, the annotations themselves can not be used as a link or reference as their text content is empty. This also results in a visually broken annotation page. I believe there is no other way to extract the text again? So I would suggest to replace the empty text with the filename + "reference" + increasing number.
Can you provide pictures?
Sure! 😊 Annotations are well integrated into the pdf
The annotations page is broken
I believe there is no other way to extract the text again?
Well you can delete the annotation page and rerun LogseqPDFImporter and try to modify the arguments for the overlap
So I would suggest to replace the empty text with the filename + "reference" + increasing number.
What do you mean by reference?
I pushed some fix, can you try it and provide pictures please?
(Nothing to do with that but given the type of pdf you're reading and being a medical student myself you might be interested in taking a look at my other repos. For example DocToolsLLM and anki related stuff. I have more repos to create in the coming 12 months too that are geared towards education.)
Works great now! Images are still broken, but I think it's a path problem. This is what gets generated in the assets folder:
What do you mean by reference?
Just as you did it but in a more "elegant" way 😊 So the filler text of the annotation will be "Bataller-Outcomes and genetic dynamics of acute myeloid leukemia at first relapse-2020-Haematologica Reference 12" instead of "Notext 12"
(Nothing to do with that but given the type of pdf you're reading and being a medical student myself you might be interested in taking a look at my other repos. For example DocToolsLLM and anki related stuff. I have more repos to create in the coming 12 months too that are geared towards education.)
Very interesting stuff, thank you for sharing these! Will keep an eye on them. Where are you based at?
Where are you based at?
Too privacy conscious for sharing that sorry :)
I'll see to the rest another day
I pushed a commit for the new filename, can you check if it's better for you?
Also can you investigate a bit the path issue for the images? For example by telling me the fullpath of the images in the assets folder and the path / id indicated in the .edn file?
Up
Without answers from you I decided to go ahead and merge the two branch. I'm still bothered by this thing about image paths being wrong so please don't hesitate to re-open this issue when you have time to share with me some more infos so that I can fix it.
Thank you for your great work on making this possible for Logseq! For research purposes I'm using the Readcube Papers app as a reference manager and for annotations on device. It has the nice possibility to export the annotations embedded in a PDF file. Unfortunately, they appear to use non-unique UUIDs. It would be amazing if there is a workaround for this, as the Papers app + Logseq would make for a really nice research workflow.
Console error: