stephanrauh / ngx-extended-pdf-viewer

A full-blown PDF viewer for Angular 16, 17, and beyond
https://pdfviewer.net
Apache License 2.0
474 stars 181 forks source link

Custom Find Controller? #2339

Closed Hypercubed closed 1 month ago

Hypercubed commented 4 months ago

I have a need to add regex search to the find toolbar. I think I have a clear understanding of ngx-extended-pdf-viewer custom components; etc. I can create custom components that I need but the search functionality itself is buried deep in pdfjs; specifically private methods in PDFFindController.

Options:

  1. Add this full functionality to ngx-extended-pdf-viewer and stephanrauh/pdf.js as PRs to both projects.
  2. Add private functionality to stephanrauh/pdf.js and create custom component on my end to expose it.
  3. Modify ngx-extended-pdf-viewer to allow overriding PDFFindController so that I can maintain my own "fork" of just that class.
  4. Modify stephanrauh/pdf.js to make all private methods public; allowing me to override methods in pdfViewer.findController on my end.
  5. Maintain my own fork of stephanrauh/pdf.js.

I feel that # 4 would be the least impactful to your projects; but obviously will add some complexity merging upstream pdfjs changes. I'm willing an able to do the code changes myself... but would like to know if you have any thoughts on the best way to proceed.

Thank you.

stephanrauh commented 4 months ago

I assume you want to benefit from improvements of pdf.js, so anything involving a fork sounds painful. I only manage because I merge the new changes every week.

I wonder if you can implement a system to add plugins to the findbar. That's similar to option 4, but instead of overwriting methods you define user hooks to add custom code. If you're ready to send me such a PR, I'll happily merge it. I guess it's a sensible compromise between risking merge hell and adding flexibility to the library.

BTW, you regex search might be interesting to other users, too. What about adding it to a demo page of the showcase?

stephanrauh commented 4 months ago

Come to think of it, the plugins might bring back fuzzy search and searching multiple terms. I had to abandon these improvements some time ago because I ran into the merge hell. It would be nice to get them back!

Hypercubed commented 4 months ago

That's basically what I want... regex search and multiple terms. I basically only need to add one line in /web/pdf_find_controller.js but once I do that I'm forced to "fork" pdf_find_controller.js, pdf_find_utils.js and ui_utils.js. I then need to "overwrite" the instance of findController in each page. Here is my code so far:

import type { PDFPageView } from 'ngx-extended-pdf-viewer';
import { PDFFindController } from './pdfjs_overrides/pdf_find_controller';

export function customFindControllerFactor() {
  const PDFViewerApplication = (window as any).PDFViewerApplication;
  const eventBus = PDFViewerApplication.eventBus;
  const linkService = PDFViewerApplication.pdfLinkService;

  // Create a custom FindController
  const findController = new PDFFindController({ eventBus, linkService } as any);

   // Replace findController on each page
  eventBus.on('pagesloaded', () => {
    findController.setDocument(PDFViewerApplication.pdfDocument);
    const pages: PDFPageView[] = (PDFViewerApplication.pdfViewer as any)._pages;
    pages.forEach(page => {
      (page as any)._textHighlighter.findController = findController;
    });
  });
  return findController;
}

Not very clean or maintainable. If we were to write a plugin system I'm guessing it would be almost the same thing. Something to override the built in PDFFindController.

Edit: And this to overide the instance in the pdfjs application:

  pdfLoaded() {
    const findController = customFindControllerFactor();
    this.pdfViewerApplication.findController = findController as unknown as FindController;
  }
Hypercubed commented 4 months ago

I'm thinking of this approach. Both would be changes to stephanrauh/pdf.js that would allow ngx-extended-pdf-viewer (or end users) to "plug" the find feature.

  1. Update all methods in /web/pdf_find_controller.js (in stephanrauh/pdf.js) to be public (replace this.#methods with this._method. If the methods are public it would be easy to create a new class that extends PDFFindController. The developer extending PDFFindController doesn't need to override every method since we can call super._method in the base class.
  2. Update PDFViewerApplication._initializeViewerComponents to look for constructor overrides for PDFFindController in AppOptions (or similar). That would allow end users to replace PDFFindController with the extended class without jumping through the hoops of iterating over all the pages like I did above.

Not quite a plugins... but close. You could resonably do this to other objects as well.

Here is what I would guss my "plugin" would look like:

class CustomPDFFindController extends PDFFindController {
  _convertToRegExpString(query, hasDiacritics) {
    const {  matchRegex } = this._state;
    if (matchRegex) return [false, query];
    return super._convertToRegExpString(query, hasDiacritics);
  }
}

Registered something like:

PDFViewerApplicationOptions.set('FindControllerConstructorOveride', CustomPDFFindController);
stephanrauh commented 4 months ago

Yes, that's an option. You're definitely on the right track. It's just that I'd like to find a minimally invasive approach. And it seems I'm less shy about manipulating prototypes. :)

customQueryConversion(query) {
  return query; // idempotent default implementation
}

#calculateMatch(pageIndex) {
   let query = this.#query;
    if (query.length === 0) {
      return; // Do nothing: the matches should be wiped out already.
    }
    const { caseSensitive, entireWord } = this.#state;
    const pageContent = this._pageContents[pageIndex];
    const hasDiacritics = this._hasDiacritics[pageIndex];

    let isUnicode = false;

    // allow users to add their plugins <<<<<<<<<
    query = customQueryConversion(query);
    // end of the modification <<<<<<<<<<<<<<<
    if (typeof query === "string") {
      [isUnicode, query] = this.#convertToRegExpString(query, hasDiacritics);
    } else {
...

and

(PDFViewerApplication as any).pdfViewer.findController.customQueryConversion = () => { /* put your magic here */ };

How do you like the idea?

Hypercubed commented 4 months ago

That might work in some cases... but in my case I need to conditionally skip #convertToRegExpString. I'd either need a way to say use only customQueryConversion in some cases, skiping the #convertToRegExpString or be able to call #convertToRegExpString within customQueryConversion; which is not possible for privates.

Hypercubed commented 4 months ago

If we wanted that it would have to be something like:


customQueryConversion(query) {
  return false; // idempotent default implementation
}

#calculateMatch(pageIndex) {
  ...

  customQuery = customQueryConversion(query);
  if (customQuery !== false) {
    // customQuery is not false, use this
    query = customQuery;
  } else {
    // otherwise do the normal stuff
    let isUnicode = false;
    if (typeof query === "string") {
    ...
  }

  // then do everything else
  ...

then

(PDFViewerApplication as any).pdfViewer.findController.customQueryConversion = () => { 
  const { matchRegex } = this.state;
  if (!matchRegex) return false;
  /* put your magic here */
};
stephanrauh commented 4 months ago

That shouldn't be a problem - customQueryConversion can return more than one value:

customQueryConversion(query) {
  return {
     skipDefaultBehavior: false;
     query; // idempotent default implementation
  }
}

#calculateMatch(pageIndex) {
   ...
    // allow users to add their plugins <<<<<<<<<
    { query,  skipDefaultBehavior} = customQueryConversion(query);
    if (skipDefaultBehavior) {
       // do nothing;
    }     
    else if (typeof query === "string") { // end of the modification <<<<<<<<<<<<<<<
stephanrauh commented 4 months ago

We've sent our messages simultaneously - and we're thinking along similar lines!

stephanrauh commented 4 months ago

I suppose my approach isn't perfect yet. But I hope we can polish it until it's useful to support most (or every?) use-case. Things we still need to consider (or at least I should consider it, I'm not familiar with your application):

Hypercubed commented 4 months ago

I also might have some modifications to #calculateRegExpMatch. Obviously I'll do whatever you think is best for your maintance. I'd be happy if I can just do this. Notice that i'm putting matchRegex into the findController state. That's easy enough to do when I dispatch "find".

const originalCalculateMatch = findController._calculateMatch;  // only works if _calculateMatch is public
findController._calculateMatch = function (query) {
  const { matchRegex } = this.state;
  if (!matchRegex) return originalCalculateMatch(query);
  /* put your magic here */
};

const originalCalculateRegExpMatch = findController._calculateRegExpMatch;
findController._calculateRegExpMatch = function (query) {
  const { matchRegex } = this.state;
  if (!matchRegex) return originalCalculateRegExpMatch(query);
  /* put your magic here */
};

Then the only change would be the search and replace #calculateMatch -> _calculateMatch and same for #calculateRegExpMatch.

Might need a few .bind() here and there...

Hypercubed commented 4 months ago

FYI.. overriding the find bar is working great. Adding the "matchRegex" buttons is not too difficult:

image
Hypercubed commented 4 months ago

My override of calculateMatch is not right... but you get the idea.

Edit:

This is what I need:

    const originalConvertToRegExpString = findController._convertToRegExpString.bind(findController);  // only works if _convertToRegExpString is public
    findController._convertToRegExpString = function (query: any) {
      const { matchRegex } = this.state;
      if (matchRegex) return [false, query];
      return originalConvertToRegExpString(query);
    };
Hypercubed commented 4 months ago

@stephanrauh Let me know if the PR to pdfjs works for now. I tested locally and allowed me to override the convertToRegExpString method as shown. If that works, and once it's merged, I can try to add a demo to ngx-extended-pdf-viewer.

stephanrauh commented 4 months ago

Yes, everything seems to work fine. Thanks!

Hypercubed commented 4 months ago

Do we also need to apply these changes to the non-bleeding edge version?

stephanrauh commented 4 months ago

I already did. I just didn't push my changes yet. Currently, I'm merging the latest changes of pdf.js to the bleeding edge, and like so often, that's a bit painful. :)

stephanrauh commented 4 months ago

Your improvement has landed with version 20.1.0. Now I'm looking forward to your demo.

Enjoy! Stephan

stephanrauh commented 4 months ago

Regarding your code snippet:

    const originalConvertToRegExpString = findController._convertToRegExpString.bind(findController);  // only works if _convertToRegExpString is public
    findController._convertToRegExpString = function (query: any) {
      const { matchRegex } = this.state;
      if (matchRegex) return [false, query];
      return originalConvertToRegExpString(query);
    };

I prefer the fat arrow syntax over using function(){}. It's mostly a matter of taste, but it's a better match to how my brain works. I suppose it also allows you to get rid of the .bind() bit. That's something that's always confusing me, so take it with a grain of salt, but I think that's one of the reasons why I prefer fat arrows: the closure syntax automatically means we're using variables from the enclosing scope, and more often than not, that's what you expect intuitively.

Hypercubed commented 4 months ago

If you use the fat arrow I don't be able to access this.state... instead would need to use findController directly. Something like:

    const originalConvertToRegExpString = findController._convertToRegExpString.bind(findController);  // only works if _convertToRegExpString is public
    findController._convertToRegExpString = (query: any) => {
      const { matchRegex } = findController.state;
      if (matchRegex) return [false, query];
      return originalConvertToRegExpString(query);
    };

Would still need to bind the original _convertToRegExpString or else it will also lose the this binding.

Hypercubed commented 4 months ago

The "proper" way is likely this; to ensure the correct bindings. function can reference this (fat arrow cannot) and the this context of our overide is passed down into the originalConvertToRegExpString. Messing with methods on classes like this can be tricky.

    const originalConvertToRegExpString = findController._convertToRegExpString;
    findController._convertToRegExpString = function (query: any) {
      const { matchRegex } = this.state;
      if (matchRegex) return [false, query];
      return originalConvertToRegExpString.call(this, query);
    };
Hypercubed commented 4 months ago

Even more proper way is to expose the constructor and let the user extend it using class X extends Y as I outlined above... that would mean even more changes to pdfjs to maintain.

Edit: Meaning if I had access to PDFFindController and could swap out the constructor after viewer is loaded but before it is initialized I could do this:

class CustomPDFFindController extends PDFFindController {
  _convertToRegExpString(query) {
    const {  matchRegex } = this._state;
    if (matchRegex) return [false, query];
    return super._convertToRegExpString(query);
  }
}
stephanrauh commented 4 months ago

I'm not a friend of flame wars - so there's not "proper" way in my book. There are always at least two ways that work. I just wanted to make you think about a simpler implementation, and obviously, I managed. No matter why approach you choose - now I know you thought thoroughly about it!

Hypercubed commented 4 months ago

No flame wars from me... thank you for your work on this project.

stephanrauh commented 4 months ago

Actually, I wanted to say that I don't want to start a flame war. :) And I really enjoy discussing with you!

Hypercubed commented 4 months ago

Same to you!

Working like a charm:

image

I need to figure out how and where to handle invalid regex then I can see about adding to the demo.

stephanrauh commented 4 months ago

That looks promising!

rmdorsey commented 2 months ago

Hi, thanks for creating this feature. I'm using it to programmatically find several phrases at once.

I'd like for these highlights to remain even after a user begins to do their own manual search over a document. Currently the viewer removes all highlights before highlighting the new word or phrase.

After investigating with the browser dev tools I see that a class name 'highlight' is being added to the span tag surrounding word(s) of the search phrase. CSS is looking for this class to add the coloring.

I suppose one way to solve this issue would be to somehow change the naming of these classes, e.g. call it something different such as 'highlight2', so that the attribute doesn't get removed if/when a user uses the search bar to search for a word/phrase.

Is there a better way to do what I'm describing?

Hypercubed commented 2 months ago

@rmdorsey I would love to find a way to add additional classes when doing the highlight. Since the search allows multiple phases, add an additional unique class for each phase.

Unfortunatly, the highlight feature is so deep in pdfjs this might be difficult. I wonder if it would be easier to use annotations?

rmdorsey commented 2 months ago

@rmdorsey I would love to find a way to add additional classes when doing the highlight. Since the search allows multiple phases, add an additional unique class for each phase.

Unfortunatly, the highlight feature is so deep in pdfjs this might be difficult. I wonder if it would be easier to use annotations?

The search allows for multiple phases? Where can I find an example of that?

Regarding annotations, all I've seen on annotations is here: https://pdfviewer.net/extended-pdf-viewer/export-annotations I'm not sure how I'd know where to begin and end any drawings over the text. Is there another relevant annotation example you could point me to?

Hypercubed commented 2 months ago

Well, the PDFjs API allows for multiple search terms; but the UI doesn't use that In fact later these terms get concatinated into a single regex... which is what will make it hard to manage different classes per term.

I don't really know anything about annotations.

rmdorsey commented 2 months ago

Well, the PDFjs API allows for multiple search terms; but the UI doesn't use that In fact later these terms get concatinated into a single regex... which is what will make it hard to manage different classes per term.

I don't really know anything about annotations.

Interesting. The documentation is difficult to navigate through assuming I've found the right site... Is this it? https://mozilla.github.io/pdf.js/api/ Where can I find information about the multi-search?

I'm using regex already via your solution to conduct multi-search, but I'm curious to try their solution.

I want to tag all multi-search terms with the same class (e.g. "highlight2"). If there's 8 found phrases they'll all receive the same class name "highlight2". This way if the user uses the ui find feature it won't remove "highlight2" class because the ui find feature uses the default "highlight" class.

stephanrauh commented 2 months ago

@rmdorsey If you're interested in annotations, have a look at the PDF specification. Annotations are part of the PDF file and allow for adding interactive elements like forms or special-purpose elements like signatures. They're rendered as an additional layer above the canvas and the text layer.

I'm afraid the regex search isn't documented because it's part of the viewer. The documentation only covers the PDF rendering engine.

stephanrauh commented 2 months ago

I've tested your approach of adding custom classes to the spans of the text layer here: https://pdfviewer.net/extended-pdf-viewer/textlayer. You're right, as soon as the user starts searching, the custom CSS classes are overwritten.

Maybe you could catch the (textLayerRendered) event, copy the text layer, and add your custom highlights to your copy of the text layer.

Maybe you need to remove the text from the spans. That, in turn, requires you to set the width as an actual attribute before deleting the text.

Possibly adding annotations is the simpler approach. :)

rmdorsey commented 2 months ago

"Maybe you could catch the (textLayerRendered) event, copy the text layer, and add your custom highlights to your copy of the text layer."

Yes - I'm currently writing something to do just that. The biggest issue now is finding phrase(s) being searched for. The PDFs I'm searching sometimes have a word broken up across multiple span tags. This makes finding the phrase very difficult because I have to compare each letter in the pdf document to each letter in the search phrase while stepping through numerous span tags. After it's found a match I then add the classes to the span tags and in some cases have to create a span tag with the class when there's only a partial match within the original span tag.

It's a lot to keep track of!

"Possibly adding annotations is the simpler approach."

I'm not sure... I think that would require me to know the precise location of the search phrase in the document before drawing the highlight. I'm not sure how I would measure for the placement of all the annotations.... Also, based on my experience with another viewer adding annotations covers up the text and prevents the user from being able to copy the phrase from the PDF. I'm not sure if this is the case with your viewer.

Thank you for taking the time to read and reply! Overall your project is great!

stephanrauh commented 2 months ago

At the moment, I'm trying to remember how the find controller works. I suppose it makes sense to provide a programmatic API allowing you to get the actual HTML elements that match. I suppose it also makes sense to allow for custom highlights, and to make them survive user searches. Unfortunately, the find controller of pdf.js works asynchronously, which makes everything more difficult, and it hasn't been written with extensibility in mind.

rmdorsey commented 2 months ago

I see that you opened an issue regarding our discussion. Is there a rough estimate for how long implementation might take?

stephanrauh commented 2 months ago

Not yet. At the moment, I just feel I ought to improve the find controller and the programmatic API and open it for customer hooks. But I don't know yet if that's possible, how to do it, and if it's a good idea in the long run. I'm 90% sure I can make you guys happy, but there's no guarantee, nor is there a timetable.

rmdorsey commented 2 months ago

Regarding the find by regex feature; Is there a way to prevent the viewer from doing an auto jump to the page where the search result is?

stephanrauh commented 2 months ago

Currently not - but I think it should be easy to implement this feature.

rmdorsey commented 1 month ago

Not yet. At the moment, I just feel I ought to improve the find controller and the programmatic API and open it for customer hooks. But I don't know yet if that's possible, how to do it, and if it's a good idea in the long run. I'm 90% sure I can make you guys happy, but there's no guarantee, nor is there a timetable.

Hi Stephan, thanks again for all you're doing. Just checking in to see if there's an update regarding this feature request.
Thanks, Matt

stephanrauh commented 1 month ago

Not really, but I've finally managed to publish the huge refactoring called version 21. This, in turn, means I can now afford time to your ticket.

stephanrauh commented 1 month ago

You might be interested in the changes I've added in version 21.2.0. It's not precisely what you're looking for, but it already gives you new options.

Hypercubed commented 1 month ago

@stephanrauh That's very cool. Is the customFindController a seperate instance of PDFjs's FindController or something maintained in ngx-extended-pdf-viewer?

stephanrauh commented 1 month ago

It's a separate instance. Basically, I simply duplicated the line findController=new PdfFindController().

stephanrauh commented 1 month ago

@Hypercubed @rmdorsey I believe I've found a nice solution. Would you like to define your own custom find controller, without being restricted by the pdf.js implementation? As it turns out that's possible. The API of PDFFindController is suprisingly small. Basically, it only reacts to a few events and it doesn't offer any public API.

So it's easy to write your own find controller. I assume most people prefer to extend or modify the original find controller, and this is also possible. My prototype https://github.com/stephanrauh/extended-pdf-viewer-showcase/commit/aebdf6b8d63f459a036d93d93e40fb6690be97ad shows the idea.

How do you like it?

Hypercubed commented 1 month ago

Mostly just curious. I'm still doing the overide of the methods we exposed and using the existing UI.

stephanrauh commented 1 month ago

Well, that's still possible, and there's nothing wrong with that. What I like about my idea is that it opens the way to a find controller that's written in TypeScript and integrates seamlessly into Angular.

stephanrauh commented 1 month ago

Version 21.3.0 has landed, offering you the option to define your own custom controller.

I'm pretty sure the document needs a lot of improvement. The current version is here: https://pdfviewer.net/extended-pdf-viewer/custom-find.

Enjoy! Stephan

rmdorsey commented 1 month ago

Hi Stephan, thanks for this!

I'm trying to use it now by creating a new angular app and bringing in the newest version of the player.

I've set things up like you have here in your showcase repo: https://github.com/stephanrauh/extended-pdf-viewer-showcase/tree/main/src/app/extended-pdf-viewer/custom-find

Your Angular skills are very advanced... I'm wondering if you can give me any pointers regarding these errors I'm seeing...

image