org-roam / org-roam-bibtex

Org Roam integration with bibliography management software
GNU General Public License v3.0
571 stars 47 forks source link

(feat,fix) save/export scrapper buffers #146

Closed myshevchuk closed 3 years ago

myshevchuk commented 3 years ago
myshevchuk commented 3 years ago

This PR introduces commands orb-pdf-scrapper-save and orb-pdf-scrapper-save-as. The former command saves current progress in a Scrapper buffer to the corresponding temp file. It shadows Emacs' save-buffer in Scrapper buffers. For those (including myself) who press C-x C-s on every occasion. The command orb-pdf-scrapper-save-as (C-x C-w) shadows write-file allowing to correctly save the buffer to a new location without breaking the Scrapper session.

@j-steinbach this PR has been merged into develop, you are welcome to test it.

j-steinbach commented 3 years ago

It "works". I ran the PDF Scrapper, and saved the "Text mode" and "BibTeX mode" buffers to a different file.

What I am missing is the ability to insert the content into the buffer where I started the PDF Scrapper process.

myshevchuk commented 3 years ago

Yeah, I remember. Just haven't had enough time.

myshevchuk commented 3 years ago

So for now I plan to introduce two user options orb-pdf-scrapper-export-text-references and orb-pdf-scrapper-export-bibtex-references. They will have allowed values of heading and file. In the first case, references will be saved under a separate heading in the original Org roam buffer. In the second case, they will be save into a separate file. Perhaps another variable will be needed to control the location of the exported files.

The variable orb-pdf-scrapper-export-bibtex-references will be allowed to have a string value, in which case the BibTeX references will be appended (or prepended) to an existing .bib file, except for those whose keys already match entries in that file.

myshevchuk commented 3 years ago

@j-steinbach So, with the latest commit a feature was added to automatically export the scrap(p)ed data upon hitting C-c C-c in the Org-mode buffer.

The user option orb-pdf-scrapper-export-options (the name may change before merging) controls the automatic export.

This variable is an association list of the form

(TYPE . ((LOCATION TARGET PROPERTIES))

Example:

(setq orb-pdf-scrapper-export-options
      '((org
         (headline "References (extracted by ORB PDF Scrapper)"
                   :property-drawer (("PDF_SCRAPPER_TYPE" . "org")
                                     ("PDF_SCRAPPER_SOURCE")
                                     ("PDF_SCRAPPER_DATE")))
         (file "temp.org"
               :placement append))
        (txt
         (headline "References (extracted by ORB PDF Scrapper)"
                   :property-drawer (("PDF_SCRAPPER_TYPE" . "txt")
                                     ("PDF_SCRAPPER_SOURCE")
                                     ("PDF_SCRAPPER_DATE")))
         (file "directory/")
        (bib
         (file "my-library.bib"
               :placement prepend))))

The first element of the list controls export of the Org data (TYPE = org). There are two locations declared for the Org export: headline and file. So the Org data will be export both to the headline named "References (extracted by ORB PDF Scrapper)" in the buffer of origin and a file named "temp.org".

The headline will be supplied with a property drawer with the following properties, according to the :property-drawer property:

Currently, this property stuff is a prototype. It's not very useful yet, but may become more useful in the future.

Now the file LOCATION. The TARGET of the file location can be an existing directory, in which case a new file will be created with the name of the note's citation key + the corresponding extension (org, bib or txt) and the extracted data will be put there. If the TARGET is not an existing directory, it is assumed to be a file. If the file does not exist yet, it will be created. If the file exists, it will be used. The :placement property controls where to put the extracted data - at the begging of the file (symbol prepend) or at the end (symbol append). The TARGET can be an absolute path or a path relative to the note's of origin directory.

In the above example, a file temp.org in the note's directory will be used as a target. If it does not exist, it will be created, if it exists, the org data will be put at end of the file.

The second element of the list, the txt TYPE, declares that the text data should be put both under a headline in the note's buffer and into a new file (say Doe2021.txt) in a directory directory that resides within the note's directory.

The third element of the list, the bib TYPE, declares that the bibtex data should be prepended to a file my-library.bib.

This is a breaking change. Because all elements of the list are optional, no data will be exported at all if the variable orb-pdf-scrapper-export-options is nil (it is not by default). To emulate the previous behaviour of Orb PDF Scrapper, this variable must have the following value:

(setq orb-pdf-scrapper-export-options
      '((org
         (headline "References (extracted by ORB PDF Scrapper)"))))

Although it seems to work for me, this feature is only a preview, and any suggestions are welcome. I will merge it after cleaning up and documenting, and probably some bug hunting.

myshevchuk commented 3 years ago

Probably LOCATION should be called TARGET and vice versa.

j-steinbach commented 3 years ago

It's a lot to wrap your head around, but overall it looks fine. Two things:


What if I want to save in another file under a specific heading? At the moment it only prepends/appends the data.

Wouldn't it make more sense to specify the file (TARGET :question:) as either original or path, where original is the file I started the PDF Scrapper process and path is a file of my choice. And then I can select where in the document I want to save the data. Either a specific heading, or prepend or append.


It likely makes sense to give the user an ability to do something after the extraction. I am too newb for it, but I think it is called hooks?

For example, I would like to look-up the extracted cite-keys and -if they don't already exist- create a new file for each of them.

At the moment it doesn't make that much sense (as I still need to manually fix the reference keys to match my Zotero database and bibtex file), but if that feature also gets rolling, the path to automation :steam_locomotive: and astra :star: is open!

j-steinbach commented 3 years ago

Oh, if everything is optional, wouldn't it make sense to be able to change/define the headline "References (extracted by ORB PDF Scrapper)"?

myshevchuk commented 3 years ago

What if I want to save in another file under a specific heading? At the moment it only prepends/appends the data.

Wouldn't it make more sense to specify the file (TARGET ❓) as either original or path, where original is the file I started the PDF Scrapper process and path is a file of my choice. And then I can select where in the document I want to save the data. Either a specific heading, or prepend or append.

Fully agree. I would keep the existing structure of the list, though. The identifier file could be renamed to path. The list would then look something like this:

(setq orb-pdf-scrapper-export-options
      '((org
         (headline "References (extracted by ORB PDF Scrapper)"
                   :property-drawer (("PDF_SCRAPPER_TYPE" . "org")
                                     ("PDF_SCRAPPER_SOURCE")
                                     ("PDF_SCRAPPER_DATE")))
         (path "temp.org"
               :placement (headline "References (extracted by ORB PDF Scrapper)"))))

So the top-level headline target would imply the buffer of origin, as it does now. The inferior headline target within the path target would specify the heading in that file. The headline placement directive would only take effect when the target file is an Org-mode file.

It likely makes sense to give the user an ability to do something after the extraction. I am too newb for it, but I think it is called hooks?

For example, I would like to look-up the extracted cite-keys and -if they don't already exist- create a new file for each of them.

It's possible to provide some hooks. There are several points in the overall process, where these hooks can be called.

  1. Reference extraction
  2. Conversion to BibTeX
  3. Coversion to Org
  4. Export
  5. Additionally, the whole Scrapper process can be such a point.

Each point can have two associated hooks - before and after. For your purpose, I'd suggest either 4 or 5, e.g. orb-pdf-scrapper-after-export-hook or orb-pdf-scrapper-end-process-hook.

Oh, if everything is optional, wouldn't it make sense to be able to change/define the headline "References (extracted by ORB PDF Scrapper)"?

Sure, this string is not hardcoded, it's exposed as a part of a user option - just change "References (extracted by ORB PDF Scrapper)" to whatever you like.

myshevchuk commented 3 years ago

I've heavily refactored and hopefully optimized the export code and added a target heading export option as requested. Please check the latest commit in this branch.

Consult the docstring of orb-pdf-scrapper-export-options for a description of the available options. The following example should demonstrate what's possible:

(setq orb-pdf-scrapper-export-options
      '((org  ;; <= TYPE 
         ;;  Export to a heading in the buffer of origin
         (heading "References (extracted by ORB PDF Scrapper)".   
         ;; ^             ^
         ;; TARGET     LOCATION
                     ;; PROPERTIES
                     ;;    v
                     :property-drawer ("PDF_SCRAPPER_TYPE"
                                       "PDF_SCRAPPER_SOURCE"
                                       "PDF_SCRAPPER_DATE")))
        (txt
         ;; Export to a file "references.org"
         (path "references.org"                           
               ;; under a heading "New references"                     
               :placement                                                            
               (heading "New references"
                        :property-drawer ("PDF_SCRAPPER_TYPE"
                                          "PDF_SCRAPPER_SOURCE"
                                          "PDF_SCRAPPER_DATE")
                        ;; Put the new heading in front of other headings
                        :placement prepend)))               
        (bib
         ;; Export to a file in an existing directory.  The file name will be CITEKEY.bib
         (path "/path/to/references-dir/"                            
               :placement prepend
              ;; Include only the references that are not in the target file
              ;; *and* the file(s) specified in bibtex-completion-bibliography
               :filter-bib-entries bibtex-completion-bibliography))))  

I will merge it into master after adding a proper Customize definition and updating the README.

I'd also like to add a little bit more flexibility in heading and file names by allowing for wildcards à la orb-templates. I will also probably slightly elaborate filtering options. But this things will most probably follow after this branch has been merged into master.

myshevchuk commented 3 years ago

Please also make sure to backup your bib files if they are important to you and if you are are going to automatically export the extracted entries to master bib files!

myshevchuk commented 3 years ago

Fix for #151 goes here.

j-steinbach commented 3 years ago

Sorry, but I think I am heavily confusing myself here. Do you want me to test this? And if yes, when do you want me to test this? In scrapper-save or when you put it into master? Your comment "Fix for #151 goes here" is throwing me off.

Also, as I have a separate Zotero database which updated my "Emacs .bib" file, it should be save for me to corrupt that Emacs .bib file, as I can rebuild from Zotero. Or is there any danger I don't know?

myshevchuk commented 3 years ago

Sorry for the confusion. Although I would appreciate any feedback, this was not a testing request. I had an impression you could be interested in an early adoption of the new functionality into your workflow. The message was: it is usable now but it will take some time until the changes make it into the master, there may also be bugs.

Regarding fix #151, it was more of a memo for myself (as are many other comments here). Fix for #151 is available in this branch and will be available in master after this branch has been merged. There will be no separate fix for the current master.

Also, as I have a separate Zotero database which updated my "Emacs .bib" file, it should be save for me to corrupt that Emacs .bib file, as I can rebuild from Zotero. Or is there any danger I don't know?

No, it should be fine.

j-steinbach commented 3 years ago

Ok, good to know. You are fine, I just had a heavy case of tunnel-vision coupled with stress and busyness. Also I am not used to "cooperating" on github, so yeah...

I am definitely interested in this feature (in the early adopter sense), but I will wait. (As I am a bit scared of switching branches and getting everything to work again. I need my system to work flawlessly atm..) But as soon as I get it going, you have to brace yourself, as feedback is coming :)

j-steinbach commented 3 years ago

Ok, I got around to "installing" the "scrapper-save" branch, but I am having problems configuring it (again!)...

This is my "literate config" block. I appended the orb-pdf-scrapper-options. Now it says invalid read syntax: ". in wrong context" when I evaluate it (C-c C-c).

There are also a few lines commented out; those don't get recognized by K (describe-function?). As I went around "testing" new and unfinished features, I don't know if they are actually valid fucntions/variables anymore. I didn't have the time to get through the official documentation and re-configure everything yet again.

#+BEGIN_SRC emacs-lisp :tangle yes
(setq
 orb-pdf-scrapper-refsection-headings '((parent "References")
                                        (in-roam "In Org Roam database" list)
                                        (in-bib "In BibTeX file" list)
                                        (valid "Valid citation keys" list)
                                        (invalid "Invalid citation keys" list))
 orb-pdf-scrapper-table-export-fields '("key" "author" "date")
 orb-autokey-titlewords-ignore '("A" "An" "On" "The" "Eine?" "Der" "Die" "Das" "[^[:upper:]].*" ".*[^[:upper:][:lower:]0-9].*")
 orb-pdf-scrapper-group-references nil
 ;; orb-pdf-scrapper-citekey-format
 orb-pdf-scrapper-set-fields '(("author" orb-pdf-scrapper--invalidate-nil-value)
                               ("title" orb-pdf-scrapper--invalidate-nil-value)
                               ("date" orb-pdf-scrapper--invalidate-nil-value))
 orb-pdf-scrapper-list-style "[%s] "
 ;; orb-pdf-scrapper-reference-numbers "citation-number"
 ;; orb-pdf-scrapper-export-text-references "heading"
 ;; orb-pdf-scrapper-export-bibtex-references "heading"
 orb-pdf-scrapper-export-options
 '((org  ;; <= TYPE
    ;;  Export to a heading in the buffer of origin
    (heading "References (extracted by ORB PDF Scrapper)".
             ;; ^             ^
             ;; TARGET     LOCATION
             ;; PROPERTIES
             ;;    v
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   (txt
    ;; Export to a file "references.org"
    (path "references.org"
          ;; under a heading "New references"
          :placement
          (heading "New references"
                   :property-drawer ("PDF_SCRAPPER_TYPE"
                                     "PDF_SCRAPPER_SOURCE"
                                     "PDF_SCRAPPER_DATE")
                   ;; Put the new heading in front of other headings
                   :placement prepend)))
   (bib
    ;; Export to a file in an existing directory.  The file name will be CITEKEY.bib
    (path "/path/to/references-dir/"
          :placement prepend
          ;; Include only the references that are not in the target file
          ;; *and* the file(s) specified in bibtex-completion-bibliography
          :filter-bib-entries bibtex-completion-bibliography)))
 )
#+END_SRC

My packages.el. I went through a few doom sync, doom sync -u and doom upgrade again..

(package! org-roam-bibtex
  :recipe
  (:host github
   :repo "org-roam/org-roam-bibtex"
   :branch "scrapper-save"
))
j-steinbach commented 3 years ago

Is it possible that there is a typo in (heading "References (extracted by ORB PDF Scrapper)".?

Also now everything gets recognized. I think the trick is either restarting Emacs multiple times or doom compile (which you might have mentioned before)..

j-steinbach commented 3 years ago

E: NVM, the functions don't get recognized as valid ORB-functions (" is a variable without a source file."). They just get recognized because I declared them..

myshevchuk commented 3 years ago

Is it possible that there is a typo in (heading "References (extracted by ORB PDF Scrapper)".

Yes, that must have been my typo or some sort of autocompletion in my OS.

E: NVM, the functions don't get recognized as valid ORB-functions (" is a variable without a source file."). They just get recognized because I declared them..

Until you actually run an ORB PDF Scrapper process. This module is loaded lazily, i.e. it is not loaded together with main ORB functionality but rather after the first call to orb-pdf-scrapper-run or orb-note-actions-scrap-pdf (via orb-note-actions). After the first run, you'll be able to see all the variables and their docstrings.

j-steinbach commented 3 years ago

Ok, I have been having some "fun" with the whole process. So far I only want to insert all three result buffer into my document. This is my scrapper config (the rest is above):

orb-pdf-scrapper-export-options
 '((org  ;; <= TYPE
    ;;  Export to a heading in the buffer of origin
    (heading "Org-References (extracted by ORB PDF Scrapper)"
             ;; ^             ^
             ;; TARGET     LOCATION
             ;; PROPERTIES
             ;;    v
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   (txt
    (heading "Text-References (extracted by ORB PDF Scrapper)"
             ;; ^             ^
             ;; TARGET     LOCATION
             ;; PROPERTIES
             ;;    v
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   (bib
    ;; Export to a file in an existing directory.  The file name will be CITEKEY.bib
    (heading "Bib-References (extracted by ORB PDF Scrapper)"
             ;; Include only the references that are not in the target file
             ;; *and* the file(s) specified in bibtex-completion-bibliography
             :filter-bib-entries bibtex-completion-bibliography
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   ))

I have two windows open: on the left my note file, on the right something else (my config.org).

I start the process in the left window, in the note file.

When I finish the process "text" and "org" get inserted into my note file. "Bib" is missing.

The process also doesn't close. I get the message "wrong type argument: stringp, nil".

Now the left window shows the process buffer (where I can press C-c C-c as many times as I want to insert more "org" and "text" headings into my note file (without throwing any error)) and the right window shows me my note file.

(I can reproduce the above. If I only use a single window, a new window gets created https://imgur.com/a/WP4bvb2)

Have fun! :imp:


I am not sure if :filter-bib-entries bibtex-completion-bibliography works in the heading process. I would like it to; i.e. show me which keys/items are already in my .bib file (

For my workflow/setup I don't like to directly insert stuff into the .bib file, as this circumvents Zotero. (which is my single source-of-truth) - but I didn't yet check if I can import .bib files into Zotero, so maybe I can "export" the scrapper buffer to a temporary .bib file and insert that file into Zotero.


I also don't understand what the property-drawer does. Do I need it?

myshevchuk commented 3 years ago

The process also doesn't close. I get the message "wrong type argument: stringp, nil".

That's very strange. I could not reproduce the error by copy-pasting your configuration from the two above posts. All three headings are created and the process finishes successfully. Since in your case the process fails at inserting the bib heading, it must have something to do with it. Could you please run it once again with the debugger on, toggle-debug-on-error, and provide a backtrace? Also, what is the value of your bibtex-completion-bibliography?

I am not sure if :filter-bib-entries bibtex-completion-bibliography works in the heading process. I would like it to; i.e. show me which keys/items are already in my .bib file

It does, but in a different way you expect it. Inserted are only the entries, which are not in your bibliography file(s) specified in :filter-bib-entries VARIABLE-OR-FILE-OR-LIST-OF-FILES. The differences between exporting to heading and exporting to a bib file is that in the former case only :filter-bib-entries ... is taken into account, while in the latter case also the target file is checked in addition to any :filter-bib-entries file or files. If the target bib file is among the files specified in :filter-bib-entries, then there is no difference.

So currently the keys a silently filtered but you don't know which (unless you configured Org heading export groups, in which case keys in the in-roam and in-bib groups are likely to be the filtered keys). It would be fairly easy to provide a rudimentary report option for the filtered keys. I can think of a simple echo area message, or perhaps when exporting to a heading, putting the filtered keys separately under that heading, e.g.:

  1. Echo area message

    ORB PDF Scrapper filtered the following keys on bib export: key1, key2, key3

  2. BibTeX entries exported separately under the same heading a) As full BibTeX entries:
    
    * Bib-References (extracted by ORB PDF Scrapper)

+name: filtered-entries

+begin_src bibtex

@article{1961-CRV-607, citation-number = {1}, author = {Lichtenthaler, R.W.}, title = {N/A}, date = {1961}, volume = {61}, pages = {607}, journal = {Chem. Rev} } ...

+end_src

+name: new-entries

+begin_src bibtex

@article{2007-T-10549, citation-number = {2}, author = {Pomeisl, K. and Kvíčala, J. and Paleta, O. and Klásek, A. and Kafka, S. and Kubelka, V. and Havlíček, J. and Čejka, J.}, title = {N/A}, date = {2007}, volume = {63}, pages = {10549}, journal = {Tetrahedron} } ...

+end_src

  b) As citation keys
``` org
* Bib-References (extracted by ORB PDF Scrapper)

#+name: filtered-entries
- 1961-CRV-607
- ...

#+name: new-entries
#+begin_src bibtex
@article{2007-T-10549,
  citation-number = {2},
    author = {Pomeisl, K. and Kvíčala, J. and Paleta, O. and Klásek, A. and Kafka, S. and Kubelka, V. and Havlíček, J. and Čejka, J.},
    title = {N/A},
    date = {2007},
  volume = {63},
  pages = {10549},
  journal = {Tetrahedron}
}
...
#+end_src

Or maybe you can come up with some other style/option?

but I didn't yet check if I can import .bib files into Zotero

That's possible as far as I remember

also don't understand what the property-drawer does. Do I need it?

It holds some meta information about the extracted data like when and from what source were the data extracted. It's currently not very useful and is a sort of a placeholder for distant future features. I can vaguely envision manipulating the data under headings created by ORB PDF Scrapper, and a property drawer would greatly help to locate the target headline. But as I said, currently it's not particularly useful for you if you don't see how you can use it :) It's not required for export and can be safely omitted altogether in the orb-pdf-scrapper-export-options.

myshevchuk commented 3 years ago

By the way, are you still using the native-comp branch of Emacs?

j-steinbach commented 3 years ago

Yes, I think so. (Is there a command to check the version?)

j-steinbach commented 3 years ago

bibtex-completion-bibliography

bibtex-completion-bibliography is a variable defined in
bibtex-completion.el.

Value
"/home/jst/Gedankenwelt/Wissenschaft/zotero.bib"

Original Value
nil

debug-on-error:


Debugger entered--Lisp error: (wrong-type-argument stringp nil)
  bibtex-valid-entry()
  bibtex-skip-to-valid-entry()
  bibtex-map-entries(#f(compiled-function (key beg end) #<bytecode 0x13c14d8a396c07e8>))
  orb-pdf-scrapper--export-insert-temp-data(bib (:filter-bib-entries bibtex-completion-bibliography :property-drawer ("PDF_SCRAPPER_TYPE" "PDF_SCRAPPER_SOURCE" "PDF_SCRAPPER_DATE")))
  orb-pdf-scrapper--export-to-heading(bib "Bib-References (extracted by ORB PDF Scrapper)" (:filter-bib-entries bibtex-completion-bibliography :property-drawer ("PDF_SCRAPPER_TYPE" "PDF_SCRAPPER_SOURCE" "PDF_SCRAPPER_DATE")))
  orb-pdf-scrapper--export(bib)
  orb-pdf-scrapper--checkout()
  orb-pdf-scrapper-dispatcher()
  funcall-interactively(orb-pdf-scrapper-dispatcher)
  command-execute(orb-pdf-scrapper-dispatcher)
j-steinbach commented 3 years ago

(I will read and comment on the other bib-related stuff after/if we (actually you :angel:) fix the error. I need to fiddle with everything a bit more to form an opinion)

myshevchuk commented 3 years ago

The error should be fixed now. It was caused by uninitialized global value of bibtex-dialect in your setup, which is perfectly fine. The offending function orb-pdf-scrapper--export-insert-temp-data now sets the dialect locally while the file is being parsed.

myshevchuk commented 3 years ago

also don't understand what the property-drawer does. Do I need it?

It holds some meta information about the extracted data like when and from what source were the data extracted. It's currently not very useful and is a sort of a placeholder for distant future features. I can vaguely envision manipulating the data under headings created by ORB PDF Scrapper, and a property drawer would greatly help to locate the target headline.

A specific example where property drawers will be useful is a possible implementation of feature #142 you requested earlier. If you'd like to restart the process from the data that had been exported to a heading, ORB would need some metadata to locate that heading. A heading name is an unreliable identifier because it is likely to be changed by user. A property drawer (which will be hard-coded by default shall this feature be implemented) holding something like :PDF_SCRAPPER_TYPE: txt is an excellent anchor to bring ORB to the data.

j-steinbach commented 3 years ago

It works. Awesome!

myshevchuk commented 3 years ago

Yes, I think so. (Is there a command to check the version?)

I believe checking the value of system-configuration-features can give you a clue. Look for something like NATIVE_COMP there.

I don't use native-comp myself - tried it a couple of times but it was too raw for my daily use. I'll wait until it makes it into Emacs stable. My impression was that this feature may complicate upgrading Emacs packages. On the other hand, Doom supports it and people are using it, so maybe it's not that bad.

j-steinbach commented 3 years ago

Sorry that it took me so long to reply back, but I have a one-track mind and a deadline approaching.. :steam_locomotive:

Now let me wrap my head around everything we talked about again. I do this by means of re-iteration.


This is what I have:

This is what I want:


This is how I currently work:

I create a new note and begin the ORB PDF Scrapper process:

Text mode

BibLaTeX mode

Org mode


:dart: What I want: Keep the number of :repeat: as small as possible, as they are a major waste of time, not very fun and have a high possibility of creating mistakes (manually updating keys is never a good idea).


With this feature/branch, the workflow changes as follows:

BibLaTeX mode

Org mode


As you can see, this is an improvement on two fronts. (:cake: ):

I can still see at least two issues remaining:

I think it boils down to the following: I don't want to deal with my keys after leaving the BibLaTeX mode ever again. At the moment I still have to deal with them after the PDF Scrapper process is finished.

Overall, I believe that PDF Scrapper exists to solve two issues:


This was longer than I intended, and might be missing some features, but I think this captures my motivation behind using ORB and my workflow pretty well. :sweat_smile: I hope this gives a glimpse into what I take this feature for and provides a basis for future discussions on my end.


Or maybe you can come up with some other style/option?

Nothing came up so far yet, will tell if it does :)

myshevchuk commented 3 years ago

Hi, sorry for a long silence. Thanks for your great feedback and for taking your time to write it! My primary computer is still in the service. I will write a detailed response to your above post as soon as I get it back.

myshevchuk commented 3 years ago

Hi, I've finally got my laptop back after almost three weeks in the service and will now move towards merging this pull request into master, hopefully in a couple of days. I feel the export functionality, although quite basic, allowed you to improve your workflow and from the technical point of view is more or less polished for merging. I think only the README section is missing.


As a general remark about repetitions, it's impossible to avoid them all, that is to fully automate the process. The main reason, and from a programmer's point of view the unknown variable, is the initial set of text references generated by Anystyle and presented by ORB in the text buffer.

I sketched the first prototype of what would later become ORB PDF Scrapper in an hour or so. It consisted of several dozens of lines and was actually a fully automated piping through Anystyle text and BibTeX outputs with feeding the latter to a simple Elisp function to produce org references directly. But it became immediately obvious that such thing was unusable because there was no way to correct errors, generate citation keys in the desired format and so on. It's not the Anystyle's fault either. That program is doing an amazing job considering how different the initial input can be - starting from good and bad PDF files, different page layouts, and ending with hundreds of different citation styles.

=> There is little that can be done on the ORB's side to improve the extraction of text references from a PDF. It's possible to train a custom Finder model (anystyle command line only, ORB does not implement an interface yet). The default one is reasonably good though. Also, Anystyle is not very well documented in that respect and I'm not very fluent in reading Ruby code to figure it out myself in a reasonable amount of time. Automatically checking text references in Elisp is also not an option apart from the offered basic sanitize text command, because it would mean re-implementing parts of Anystyle or even the external modules such as pdf-to-text that Anystyle relies on. So, I'm afraid we are stuck with editing poorly extracted text references by hand in the foreseeable future. Also, since we are humans and always make mistakes, we are stuck with having to correct them now and then.


So, the next thing I did was implementing the current stepwise modal system, where the user (basically me I guess at that time) could edit the intermediate text and BibTeX input to their liking and go back to the previous mode in case some things were missing. It turned out for better, because the BibTeX data are valuable and it's good to be able to have a look at them and possibly store them in the process. Unfortunately, switching between buffers was designed one-way, e.g. you lose the "forward" progress when you go back to the previous mode. I could not clearly see how I could quickly implement a mapping between text and BibTeX, so that if you go back to text mode and edit one reference, you would not need to re-generate all the BibTeX entries, just one entry for that reference.

=> Such a feature would not necessarily decrease the amount of manual editing, but it would definitely make the overall experience more pleasant, or rather psychologically less boring - oh no, I have to re-generate all the data once again. I'm interested in implementing the text <-> BibTex (<-> org) mapping from the programmer's point of view, but due to the lack of time it's currently not in the top of the priority list. Hopefully, the export functionality brought in by this pull request partially addresses the issue.


The basic idea and my goal at the time I started with the PDF Scrapper was to extract references from a PDF file associated with an org-roam note and put them in that note as org-ref citation keys, so that org-roam automatically connects the current note with any existing notes tagged with the respective keys (#+ROAM_KEY). After having that basic functionality I spent some time to write the autokey feature because the one shipped with Emacs' BibTeX-mode simply did not suite my needs. I also implemented the interface to anystyle train command, which drastically improves text -> BibTeX conversion by creating a custom Parser model. The default Parser model shipped with Anystyle was performing poorly with citation styles I usually find in chemistry journals.

=> If you haven't use it yet and are having troubles with authour/journal/etc fields not being recognized correctly, I urge you to try training a custom Parser model. I'm also keen in improving the existing rather basic autokey functionality, e.g. #147


Now to your comments.

As you can see, this is an improvement on two fronts. (🍰 ):

Great! I'm glad it's working for you and thanks once again for your invaluable input.

I can still see at least two issues remaining: • "Blind" generation of keys in the original BibLaTeX mode. I am blindly generating keys without being able to cross-reference my global .bib file. This is what I was talking about in #144. • Having different naming-schemes between Zotero and ORB. See #147.

I think it boils down to the following: I don't want to deal with my keys after leaving the BibLaTeX mode ever again. At the moment I still have to deal with them after the PDF Scrapper process is finished.

I will have to finish the current pull request. Then I'm going to split the README file into a short README and a general manual. Documenting is part of programming, and there have been many changes to ORB over the past months. I'd also like to update the CHANGELOG and prepare a new release. It should not take too long though. Also tell me what feature is more urgent for you, #144 or #147, I'll start working on that one earlier, maybe simultaneously with the manual, but not simultaneously on both features.

Overall, I believe that PDF Scrapper exists to solve two issues:

• Extracting references from a paper. This works very well, almost flawlessly, with some minor hiccups. • Merging those references into an existing bibliographic database and creating usable cite keys. At the moment this still involves a high amount of manual labor and should be improved further.

Well, as I wrote above the initial idea was simpler, but I don't mind putting it that way :)

j-steinbach commented 3 years ago

Thank you for the write-up - it is very interesting to see how a project evolves :) And congrats on getting your laptop back!


So, I'm afraid we are stuck with editing poorly extracted text references by hand in the foreseeable future. Also, since we are humans and always make mistakes, we are stuck with having to correct them now and then.

I think that it is OK to have manual parts. But they should be non-repeating, i.e. when I am done with them, then I never have to do them again. And based on your other paragraphs, I think you agree :)


I could not clearly see how I could quickly implement a mapping between text and BibTeX, so that if you go back to text mode and edit one reference, you would not need to re-generate all the BibTeX entries, just one entry for that reference.

There are probably thousands of things wrong with my assumption, but why not just treat each line in the "text buffer" as a table row?

text | bibtex | citekey

If the text cell changes, then update the bibtex and citekey cells.


The basic idea and my goal at the time I started with the PDF Scrapper was to extract references from a PDF file associated with an org-roam note and put them in that note as org-ref citation keys, so that org-roam automatically connects the current note with any existing notes tagged with the respective keys (#+ROAM_KEY).

Smart! And similar to my use-case, except that I go one step further and also try to insert the keys into my annotations, so that only the "usable" references remain. Emphasis on "try", as that is lots of work..


If you haven't use it yet and are having troubles with authour/journal/etc fields not being recognized correctly, I urge you to try training a custom Parser model

Yay! Another rabbit-hole to burrow down into! :)


I believe that #147 is more impactful (and annoying), so I would go with that. Also I believe that #144 kinda requires #147, as there is not much to look up if the keys differ.


Good idea with the documentation. I believe that I also have some "organically grown" ORB parameters in my config - I believe that updating the docs gives everyone a "fresh start", i.e. I can clean my config and you can focus on new features :)


I got another quick idea: Is it possible to show the "Scrapper modals" side-by-side? (vertical windows)

Text-mode: PDF | Text

BibTeX-mode: Text | BibTeX

Org-modal: BibTeX | Org

Basically we see the buffer where we last came from. This allows to look-up and fix mistakes in previous buffers.

(BRB requesting)

myshevchuk commented 3 years ago

There are probably thousands of things wrong with my assumption, but why not just treat each line in the "text buffer" as a table row?

text | bibtex | citekey

If the text cell changes, then update the bibtex and citekey cells.

Basically yes. I anticipate more problems on the BibTeX end. A BibTeX entry is not a single line, but a multiline entry with several fields. What if the user has edited the entry, added, removed or changed fields. How should such changes be merged with the changes coming from the updated text entry? Simply overwriting the entry should be easy. Also, BibTeX entries can span across varying number of lines, therefore some sort of position tracking must be used. Ok, that must also be easy. There will definitely be other issues such as what to do when a BibTeX was deleted, so some assumptions about the user behavior will have to be made.

also try to insert the keys into my annotations, so that only the "usable" references remain.

Could you please elaborate? What sort of annotations, maybe ORB could assist in automation?

I believe that I also have some "organically grown" ORB parameters in my config - I believe that updating the docs gives everyone a "fresh start", i.e. I can clean my config and you can focus on new features :)

Right, I encountered many times a situation where a new user would use config examples from some half a year-old blog post, where the information is absolutely outdated and the package would therefore only throw errors. Here I blame the lack of examples in the README prompting people to look for them somewhere else. But even users who faithfully followed the official README several months ago can still experience problems because the package has changed since then. A good documentation would also mean for me that ORB is getting ready to advancing to version 1.0.

j-steinbach commented 3 years ago

How should such changes be merged with the changes coming from the updated text entry?

Not sure if I understand this correctly, but I would say that the user always has the "last word", i.e. can overwrite everything.

And I also would say that there is just one direction.

text -> bibtex -> org

If a user changes a bibtex cell, then the changes bubble up to the org cell. Text remains untouched. If a user changes a text cell, then the changes bubble up to the org cell. It changes, which bubbles up to the org cell.



also try to insert the keys into my annotations, so that only the "usable" references remain.

Could you please elaborate? What sort of annotations, maybe ORB could assist in automation?

Its actually pretty simple. Let's say that I annotated a paper and extracted its contents with org-noter. In my notes I now have some extracted annotations.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer [1] took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset (1960) sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

In this case there are two citations hidden in there (note: usually they follow the same citation style):

Then I use the ORB-Scrapper to extract the references from the paper. This gives me all references from the paper:

References

Now I go and insert those references into my extracted annotations:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer (cite:printer1558jam) took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of cite:letraset1960lorem sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Those are the citations I actually care about - they get linked directly to their source in my Zettelkasten, very similar to how you are doing it.

Finally I delete the references section. If a reference is not cited, then it clearly is not interesting for me.


There are a few problems I am encountering:

(I have a half-finished blog-post about my approach, but I didn't get to finish it, as I am supposed to write my thesis... oh well.)


I actually just yesterday started learning some elisp to maybe automate this - I already have some macros that kinda work, but they are horrifying to manage, so yeah, here I am, again not working on my thesis... :1st_place_medal:

The following takes cite:letraset1960lorem and tries to replace all letraset (1960), Letraset et al (1960), Letraset's (1960), ... It works, but I am stumbling across weird edge cases everywhere. At the moment I just assume that no author wrote two papers in the same year.

At least writing it as a function lets me fix the problems easily. My macros constantly "ring the bell". :anger:

(defun citekey-replace-author-and-year ()
  "Get the _author_ and _YEAR_ from a cite:authorYEARtitle key."
  (interactive) ; allow this to be user-callable
  (let
      ((regexp (rx ":"
                   (group (zero-or-more
                           (any letter "-")))
                   (group (one-or-more digit))
                   (one-or-more letter)))
       author
       year
       (citekey (buffer-substring (line-beginning-position) (line-end-position)))
       searchstring
       (number-matches 0)
       ) ; end of variable declaration
    (message regexp) ; start of body
    (message citekey)
    (when (string-match regexp citekey)
      (setq author (match-string 1 citekey)
            year (match-string 2 citekey)))
    (message (concat author " " year))
    (setq searchstring
          (rx (eval author )
              (any space punctuation)
              (*? not-newline)
              (zero-or-one "(")
              (eval year )
              (zero-or-one ")")
              ))
    (message searchstring)
    (save-excursion
      (while (re-search-backward searchstring nil t) ; no bounds, don't throw error if not matches
        (replace-match citekey)
        (incf number-matches)))
    (message "Found and replaced %s matches" number-matches)
    ))

For the "number" style, it is way simpler. Except if it is written as [1-3]. Good luck solving that with macros.


Ideally I could detect all references in my annotation body and then cross-reference them with my bibtex. Something like highlighting with occur, then group the same authors, then select a key to replace them with?

This also fixes the problem of references with pre-print release-years. I stumbled across a few references that say "Author 200X", but my key is "Author 200Y". In this case my regex would find nothing.


Yeah, I went a bit off-trail, but I hope the base idea is clear.

myshevchuk commented 3 years ago

Its actually pretty simple. Let's say that I annotated a paper and extracted its contents with org-noter. In my notes I now have some extracted annotations.

Very cool, but indeed it requires a lot of work. After you have finished your thesis (and I mine) we can think how PDF Scrapper could provide a rudimentary support for this, provided you'll still need PDF Scrapper. I envision, PDF Scrapper could become a bridge between PDF-tools and Anystyle and reference management software, a facility to do all references-related stuff.

j-steinbach commented 3 years ago

Sure thing! PDF Scrapper is definitely positioned that way. And I agree to hold the horses and focus on the matters on hand.

myshevchuk commented 3 years ago

This branch is now in the master!