PDF Scrapper: Format/List cite-keys in "Org mode"

j-steinbach commented 3 years ago

The resulting cite-keys from the PDF Scrapper process should be stylable. At the moment, they all appear as "cite:some2020key". There should be a user-option, that changes how they look.

Furthermore, if the "Text mode" buffer is a sorted list, then the resulting "Org mode" cite-keys should also be list-able.

"[$n] cite:$k", where $n is the list index and $k is the key.

Note: I believe that the list indices are already known in the "BibTeX mode" buffer with the property citation-number.

"Text mode"

[1] Mr. Hello 2012 World-tourney
[2] Morgan, F. 1097 Free men in the church
[3] Stanley, A. 2020 How I lost ten pounds in a week

"BibTeX mode"

@misc{hello2012world,
  citation-number = {1},
  author = ...
}
@misc{morgan1097freeman,
  citation-number = {2},
  author = ...
}

"Org mode"

Either "[$n] cite:$k"

[1] cite:hello2012world
[2] cite:morgan1097freeman

or "[$n] $k"

[1] hello2012world
[2] morgan1097freeman

or "- [$n] cite:$k"

- [1] cite:hello2012world
- [2] cite:morgan1097freeman

myshevchuk commented 3 years ago

I suggest to add a variable orb-pdf-scrapper-org-list-format, which will control the list appearance. It can have the following values.

Symbols:

value	result
`ordered-point`	1. cite:hello2012world
`ordered-parenthesis`	1) cite:hello2012world
`unordered-hyphen`	- cite:hello2012world
`unordered-plus`	+ cite:hello2012world
`unordered-asterisk`	* cite:hello2012world

The above is the standard Org-mode markup for ordered and unordered lists. Such lists can be natively parsed by org-element because they are part of Org syntax. Later functions can be added to manipulate such lists for whatever reason might be. In future, I plan to use org-ml to programmatically create Org lists and tables, and this format can be easily handled by org-ml. It is always a good idea to go with standards.

String: A format string. It must contain a valid format specifier, %s. For example the value of - [%s] will result in - [1] cite:hello2012world

The default source of numbers, where applicable, will be the citation-number field as retrieved by AnyStyle. In my experience though, this field is not always present. Furthermore, in those PDFs I'm working with, references more often than not appear as

a) first reference b) second reference
third reference
a) fourth reference b) fifth reference ...

AnyStyle's success to parse such citation numbers varies. It can be improved with training the Parser model but is never 100% reliable. This is the reason why I did not implement numbering earlier. To cover such cases, that is where the citation-number field is nonexistent or is not a number, another variable can be introduced, say orb-pdf-scrapper-org-list-numbering-strategy. I envision for it the following values (symbols):

value	result
`citation-number-raw`	Use whatever is in the field `citation-number`, empty if nothing
`citation-number`	Remove non-numerical characters
`as-retrieved`	This will result in consecutive numbering as the BibTeX entries appear in the buffer
`mixed-raw`	A mixed strategy, where `citation-number` (raw) is given preference, use natural order if empty
`mixed`	The same as above but remove any non-numerical characters

citation-number-raw and mixed-raw will only work when orb-pdf-scrapper-org-list-format is set to a format string. When it is a symbol ordered-point or ordered-parenthesis, the result will be the same as with citation-number and mixed, respectively. There will be no effect on unordered lists.

Not related to numbering and fully independent of it, orb-pdf-scrapper-citation-format will control the appearance of the citation part, e.g. cite:%s will result in cite:hello2012world, plain %s will result in hello2012world with any numbering format.

Any corrections and further suggestions are welcomed. This will require some time and won't be ready until the next weekend. While simple formatting such as suggested by you can be implement in no time, I'd prefer to have a more general solution that will suite different workflows.

j-steinbach commented 3 years ago

Sure thing. I can only suggest features, but I have no idea how the internals work. It is very interesting for me how you take my suggestions and change things.

I got another suggestion for the "source of numbers", which is very likely to be difficult to achieve, unreliable and overkill :)

Does AnyStyle identify the citation style? It could be another way to style the list.

Example: AnyStyle identifies the IEEE citation style -> The created list gets styled with citation-number

But it probably is unlikely that a user wants to have their list styled differently for each paper.

How does orb-pdf-scrapper-citation-format differ from orb-citekey-format?

myshevchuk commented 3 years ago

Does AnyStyle identify the citation style? It could be another way to style the list. Example: AnyStyle identifies the IEEE citation style -> The created list gets styled with citation-number

Yeah, it would be nice but AnyStyle does not work like that. It does not know any citation styles. It's a machine learning program where it learns patterns of author lists, patterns of journal names and so on. So it works more like a human eye - immediately "recognizing" what any particular piece of a citation string is. Don't ask me more :) I'm neither AnyStyle's author nor I know anything about machine learning.

How does orb-pdf-scrapper-citation-format differ from orb-citekey-format?

Thanks for pointing this out. I completely forgot about orb-citekey-format. It's used internally in orb-templates, but I'm actually not sure what this variable does exactly and whether it is really needed. I'll figure it out and see if it can be re-/dual- purposed.

myshevchuk commented 3 years ago

So, the requested feature seems to be ready. It has been merged into master. If it's working for you, then this request can be closed.

Regarding orb-citekey-format. I don't think it would make sense to use orb-citekey-format in PDF Scrapper for naming consistency reasons, hence orb-pdf-scrapper-citekey-format. orb-citekey-format itself is a useless variable from the user's point of view. It has some internal function, but I plan to remove it altogether. In the current design, ORB requests the user to set this variable to reflect the format of #+ROAM_KEY: they are using (or leave the default value cite:%s) because it will fail otherwise. This is conceptually flawed, because user options should allow the user to control the package's behaviour, and not the opposite - request the user to inform the package about the user's behaviour.

j-steinbach commented 3 years ago

Sounds good. Currently busy, will take a look (and likely close it) in the next few days.

j-steinbach commented 3 years ago

It seems to work. Still busy, so I didn't "test" very thorough, but it seems to work. Unfortunately after switching to "master" Emacs doesn't recognize the ORB-variables again, so I don't really know if my config is correct, but it didn't crash, which already makes me happy.

One thing I noticed - is there a way to get an empty line between [12] and In BibTeX file? And why is there no empty line between ` In Org Roam databaseand#+name: in-roam`? The styles are a bit mixed. (Not inserting empty lines is fine too, but it should be consistent).

** In Org Roam database
#+name: in-roam
[4] cite:lazzaro2008four
[9] cite:john1999big
[12] cite:ekman2007emotions
** In BibTeX file

#+name: in-bib
[7] cite:bess2002bimodal
[8] cite:mccrae1989reinterpreting
[13] cite:biederman2006perceptual
** Valid citation keys

j-steinbach commented 3 years ago

Update: Nope, the variables are fine too. No idea, apparently I just had to wait?

That means it can be closed from my side - but I will gracefully allow you to answer my last question :)

myshevchuk commented 3 years ago

One thing I noticed - is there a way to get an empty line between [12] and In BibTeX file? And why is there no empty line between In Org Roam database and #+name: in-roam? The styles are a bit mixed. (Not inserting empty lines is fine too, but it should be consistent).

It drove me crazy back then when I was writing this part because I couldn't manage to have these newlines in place. ORB actually puts them, but they mysteriously disappear, probably slurped by Org mode when it inserts new headlines. This is something that still bothers me, although I'm now used to put those newlines by hand. Feel free to open an issue for this, maybe this time I'll have more luck.

myshevchuk commented 3 years ago

Unfortunately after switching to "master" Emacs doesn't recognize the ORB-variables again

You probably have to re-compile ORB. Also check the docs for correct variable names. They are somewhat different in the final, master version.

myshevchuk commented 3 years ago

Another great way to discover user options is through the Customize interface. M-x Customize > search for org-roam-bibtex > navigate to the Orb PDF Scrapper group.

j-steinbach commented 3 years ago

Thanks for the tip! I usually try my luck with either describe-variable or describe-function.

And yes, recompiling is the answer, but also the problem. In Doom there seem to be two ways to do it, doom install and doom sync -u. I usually brute-force my way through both of them, but it often requires calling the commands multiple times until it finally does something... Worst case I delete my .emacs.d folder. Heavy handed, but it works.

I guess we can close this? I will put a request for the newline thing.

myshevchuk commented 3 years ago

Thanks for the tip! I usually try my luck with either describe-variable or describe-function.

I as well, but especially in the case of larger packages the number of variables can be overwhelming. The best source should be a well-written manual like that of Org mode. ORB's is in progress.

In Doom there seem to be two ways to do it, doom install and doom sync -u

Not really. Usually doom upgrade should suffice. This, however, will also upgrade the Doom installation, which is not a bad thing usually. One can also call doom upgrade -p to upgrade only packages, not the Doom itself. I actually almost never use doom sync because even if I just made some changes to my config, I'm always tempted to do an upgrade. If after upgrading you still experience problems, try doom build to rebuild all packages or doom build -r to rebuild the packages that need rebuilding.

Worst case I delete my .emacs.d folder

Less drastic measure is to delete the straight build directory $DOOM/.local/straight/build-XX.Y/. I rarely have to do so now, Doom's improved significantly in this respect.

org-roam / org-roam-bibtex

PDF Scrapper: Format/List cite-keys in "Org mode" #141