vedang / pdf-tools

Emacs support library for PDF files.
https://pdftools.wiki
GNU General Public License v3.0
641 stars 90 forks source link

Backward search with long source lines #242

Open aikrahguzar opened 1 year ago

aikrahguzar commented 1 year ago

Describe the bug While writing latex I like to write a whole paragraph on a single line and use visual-line-mode instead of filling to some column which I find finicky. I notice that backward search often works for the beginning of a paragraph but fails if paragraph is long.

What is the expected behaviour? Backward search finds the word I clicked on.

Additional context Digging into the source I found that it is due to how backward search uses pdf-sync-backward-context-limit. It is used

1) To gather context words around the clicked area. 2) To limit the region of the line which is searched for finding the best alignment with context words. The region is of size (* 6 pdf-sync-backward-context-limit) so search doesn't find the correct word if it more than that many characters after the start of the line.

So one solution is to increase the value of pdf-sync-backward-context-limit however as it is used for both 1 and 2 this causes massive slowdown.

I think pdf-sync-backward-context-limit should be used only for the first part while the second part should used the whole line. The following diff accomplishes this,

diff --git a/lisp/pdf-sync.el b/lisp/pdf-sync.el
index b97b3d0..b58b8d8 100644
--- a/lisp/pdf-sync.el
+++ b/lisp/pdf-sync.el
@@ -329,8 +329,7 @@ point to the correct position."
   (pdf-util-goto-position line column)
   (cl-destructuring-bind (windex chindex words)
       context
-    (let* ((swords (pdf-sync-backward--get-source-context
-                    nil (* 6 pdf-sync-backward-context-limit)))
+    (let* ((swords (pdf-sync-backward--get-source-context))
            (similarity-fn (lambda (text source)
                             (if (if (consp text)
                                     (member source text)
@@ -354,16 +353,12 @@ point to the correct position."
           (goto-char (get-text-property 0 'position word))
           (forward-char chindex))))))

-(defun pdf-sync-backward--get-source-context (&optional position limit)
+(defun pdf-sync-backward--get-source-context (&optional position)
   (save-excursion
     (when position (goto-char position))
     (goto-char (line-beginning-position))
     (let* ((region
             (cond
-             ((eq limit 'line)
-              (cons (line-beginning-position)
-                    (line-end-position)))
-
              ;; Synctex usually jumps to the end macro, in case it
              ;; does not understand the environment.
              ((and (fboundp 'LaTeX-find-matching-begin)
@@ -385,16 +380,9 @@ point to the correct position."
                             (forward-line 0)
                             (point)))
                         (point))))
-             (t (cons (point) (point)))))
+             (t (cons (point) (line-end-position)))))
            (begin (car region))
            (end (cdr region)))
-      (when (numberp limit)
-        (let ((delta (- limit (- end begin))))
-          (when (> delta 0)
-            (setq begin (max (point-min)
-                             (- begin (/ delta 2)))
-                  end (min (point-max)
-                           (+ end (/ delta 2)))))))
       (let ((string (buffer-substring-no-properties begin end)))
         (dotimes (i (length string))
           (put-text-property i (1+ i) 'position (+ begin i) string))
-- 
2.42.1

With this strategy it is possible to set pdf-sync-backward-context-limit to a smaller value than default. I have it set to 16 and everything works fine and backward search find the correct location even if I click on a word which is repeated a lot during the paragraph but even with default value of 64 and lines than are around 1000 characters long, backward search is not noticeably slower than before this change.

aikrahguzar commented 1 year ago

Another thing that is sub-optimal in my opinions is the similarity function used. It gives equal weight to all matching token no matter their length while I think giving more weight to longer words will produce better results. I am using,

(lambda (text source)
  (if (if (consp text)
          (member source text)
        (equal text source))
      (expt (length source) 2)
    (- (expt (length source) 2))))

now and it has improved finding the exact word.