vedang / pdf-tools

Emacs support library for PDF files.
https://pdftools.wiki
GNU General Public License v3.0
638 stars 90 forks source link

Better forward search #244

Open aikrahguzar opened 1 year ago

aikrahguzar commented 1 year ago

I have been looking to improve the synctex experience with pdf-tools and auctex especially with my preferred of writing every paragraph on one line and using visual-line-mode for reflowing the text. Fixing backward search turned out not to be too difficult see #242 but it seems like making forward text search more accurate is harder.

Basically the situation seems to be that although synctex is theoretically capable to providing an accurate column number, it needs tex engines to provide this information which none of them do. So it only provides line level information. However a given source line can correspond to multiple lines in pdf (and vice versa) and I in that case synctex provides multiple results about the query asking editors to somehow chose the best one.

However pdf-tools only gives access to first result of synctex forward search. I don't know c but I think that is happening here, https://github.com/vedang/pdf-tools/blob/c69e7656a4678fe25afbd29f3503dd19ee7f9896/server/epdfinfo.c#L3188C21-L3188C21 This seems to correspond to only to a single or occasionally two pdf lines corresponding to the same source line but not all of the lines. Is someone who knows c and wants better forward search willing to either, 1) Change cmd_synctex_forward_search in epdfinfo.c so that it returns the edges corresponding to the bounding box of the whole region of pdf corresponding to the source line? My guess is (I am not sure) that this would simply be the union of all the rectangles in individual search results. Some care would be need when the paragraph get split across pages. 2) Probably easier and backward compatible. Add a new function to epdfinfo.c that returns the whole list of search results and expose that to lisp so that the region can be determined from lisp.

With this change in c code, I think I can use techniques similar to those used for backward search and those in pdf-isearch for highlighting to get word level accuracy. But without a good bound on the pdf region to search, it is hard to get good enough performance.

aikrahguzar commented 1 year ago

I have managed to implement a heuristic refinement of forward search in https://github.com/aikrahguzar/pdf-tools/commit/e25ae22a93283e61eaf14c9fce614cd571a3f6b3

Since Smith-Waterman type of alignment on a whole page is too slow, it is regexp from hell variety of heuristic which is more likely to fail than the one for backward search. The failure modes are:

1) Lots of math 2) Paragraphs crossing pages in the presence of fancy header/footer. Plain pages decorated with just page numbers should be fine. 3) Lines consisting just of macros.

In these two cases, the heuristic should realize its defeat and fall back to the result provided by synctex so things should be no worse than before.

However there is third failure mode: two lines on a pdf page which are too similar except for math and text in macros. In that case the first line will get used and the results will be worse than without the hueristic. What I suggested in the earlier comment can help with this scenario.

aikrahguzar commented 3 days ago

I have now removed the hacky regexp based forward search version. It mostly worked but sometimes locked up and was just unmanageable. The lockups happened a couple times on a document I was working on so that motivated me for a better solution.

I am still c-ignorant but luckily synctex can also be used a standalone executable and the output is pretty simple. So I wrote a parser for the output and implemented a mode that correlates the text around point with the text that corresponds to edges in the synctex results. It is working well in preliminary testing.

The result is here https://github.com/aikrahguzar/pdf-tools/commit/12e8b07dce9b13d25eed956ec0f0b4a18397941d and can be tried using the default branch of my fork https://github.com/aikrahguzar/pdf-tools

It very new so expect rough edges and report about them are welcome.