PDF Scrapper without sorting of extracted keys

j-steinbach commented 3 years ago

As mentioned in https://github.com/org-roam/org-roam-bibtex/pull/44, the PDF Scrapper extracts keys from PDFs and then sorts them into in-roam, in-bib, valid and invalid.

I would like it to not sort the keys, but instead simply keep the structure of the list of extracted references.

I looked through the available variables in Emacs but didn't see anything like orb-pdf-scrapper-sort t, so I guess it always sorts at the moment.

In "text mode" it shows me the extracted list in the form

[1] coolguy2020fun
[2] dudette1729relax
[3] jobs2005iphone

But after "org mode" it sorts and turns the references into

valid
- cite:jobs2005iphone

invalid
- cite: coolguy2020fun

in-bib
- cite: dudette1729relax

I would like it to just keep it as

- cite:coolguy2020fun
- cite:dudette1729relax
- cite:jobs2005iphone

Why?

I extract my PDF annotations with org-noter. This results in text in the form

"As mentioned in [1], the dude was cool."

Then I replace all mentions of [1] with the corresponding key. (With a macro or manually. Often I also create a org-roam note.)

"As mentioned in [cite:coolguy2020fun], the dude was cool."

For this I need to know that [1] references coolguy2020fun. If the extracted references get sorted, I have to manually "unsort" them.

Ideally, the PDF Scrapper would return me the references in the form

[1] cite:coolguy2020fun
[2] cite:dudette1729relax
[3] cite:jobs2005iphone

But I don't want to ask for too much :)

Overall, the PDF Scrapper is an awesome and very helpful feature, thank you very much for creating it!

Is it scrapper or scraper?

myshevchuk commented 3 years ago

Hi, thank you for your request. That's definitely not too much and should be relatively easy to achieve. I will look into it over the weekend.

It's certainly scrapper. In the sense it harvests scraps left from scraping a PDF by AnyStyle and takes them to the Org Roam junkyard.

j-steinbach commented 3 years ago

Awesome! And yes, that sounds exactly like what I am doing - putting stuff in the junkyard and then building a rocket! :rocket:

myshevchuk commented 3 years ago

It seems to be done. Please check the pull request #136 , or pull the develop branch. If it's working for you, I'll merge it into master.

myshevchuk commented 3 years ago

Also note that depending on the PDF page layout, it is not a rare occasion that references order will be messed. It's especially true for two-columns PDF. There's not much ORB can do about it.

j-steinbach commented 3 years ago

You are very fast!

Unfortunately I seem to have come across some small hiccup on my side...

Based on my (small) understanding of Doom Emacs, doing either

(package! org-roam-bibtex
  :recipe
  (:host github
   :repo "org-roam/org-roam-bibtex"
   :branch "feature-134"))

or :branch "develop" and then syncing/building/upgrading Doom should do the job, but now I can't seem to find neither orb-pdf-scrapper-sort-references nor orb-pdf-scrapper-export-fields' nororb-pdf-scrapper-refsection-headings`.

It says

orb-pdf-scrapper-sort-references is a variable without a source file.

Do you perhaps have an idea?

j-steinbach commented 3 years ago

Ok, it appears to work. I reinstalled Doom and brute-forced my way through it. The problem probably came from me using native comp Emacs. But who knows.

Anyways, I tested it with three different PDFs and everything is in a list, in the same order as the "text mode" buffer.

There are two more things I noticed during testing. (They might warrant their own feature requests)

Is there a way to also save/export the "Text mode" and "BibTeX mode" buffers?

The "Text mode" buffer is the one I usually spent the most time in, as I have to sort it, fix it, unscramble it and also compare it to the original PDF. If I make a mistake later on, and I already quit the PDF-Scrapper process, I have to start over. As you said, ORB can't deal with two-column PDFs, so manual fixing is a must.
The "BibTeX mode" buffer is relevant, as I import it into Zotero (which in turn updates my global .bib file, which makes everything accessible in Emacs again). When I am in the process of scrapping references, I really don't want to interrupt my work-flow to copy&paste some BibLaTeX and then fiddle in Zotero.

Would it be possible to also put them into a heading in the origin file? Similar to how it puts the "Org mode" references into a heading "References (retrieved by ORB...).

Secondly, In the "BibTeX mode" buffer, I have a reference like this

@misc{gartner2013,
  citation-number = {6},
  author = {Gartner},
  date = {2013},
  title = {Gartner’s 2013 hype cycle for emerging technologies}

I change it to

@misc{gartner2013gartner,
  citation-number = {6},
  author = {Gartner},
  date = {2013},
  title = {Gartner’s 2013 hype cycle for emerging technologies}

and expect it to turn into - cite:gartner2013gartner, but in the generated "Org buffer" it turns into - cite:gartner2013 again.

Sorry for the trouble :)

myshevchuk commented 3 years ago

The problem probably came from me using native comp Emacs.

Quite likely. I have little experience with native comp though. So should I merge the branch into master?

There are two more things I noticed during testing. (They might warrant their own feature requests)

Indeed, these should be separate feature requests, which you are welcome to file.

Is there a way to also save/export the "Text mode" and "BibTeX mode" buffers?

As a temporary solution, navigate to orb--temp-dir and locate your files there. The directory persists until Emacs restart.

Secondly, In the "BibTeX mode" buffer, I have a reference like this...and expect it to turn into - cite:gartner2013gartner

Don't press y in the prompt suggesting to generate keys before proceeding to Org mode if you have manually edited the BibTeX buffer. I typically press C-c C-u to automatically generate all the keys, then go through the buffer and manually edit a few of them, perhaps also automatically re-generating several other with C-u C-c C-u. Then after pressing C-c C-c, I type n in the prompt and proceed to the Org mode - the keys are exactly what I want them to be, so I don't need to generate them once again. I know this is counterintuitive, therefore you are welcome to make suggestions on improving this workflow in a feature request.

Sorry for the trouble :)

Not at all :) ORB PDF Scrapper is quite raw still. It basically works, but can be improved much in terms of user experience. For I know how it works, I'm fine with it, others will likely run into different kinds of issues. For example, I wanted to implement a transparent save/export mechanism, but since I don't require it too often, I didn't bother to. So if you are willing to contribute ideas and participate in testing, I'll be glad to implement them.

j-steinbach commented 3 years ago

From my side you can close/merge.

I tested with a few different PDFs: Each of those resulted in a single list filled with -cite:key references. As far as I observed, they kept the order they had in the "Text mode" buffer. Note: I did not test if the old sort still works.

It seems that I have been pressing y in that buffer.

I will create a few feature requests for you :)

myshevchuk commented 3 years ago

Great!

org-roam / org-roam-bibtex

PDF Scrapper without sorting of extracted keys #134

I would like it to not sort the keys, but instead simply keep the structure of the list of extracted references.

Why?