Closed Hypercubed closed 2 months ago
I assume you want to benefit from improvements of pdf.js, so anything involving a fork sounds painful. I only manage because I merge the new changes every week.
I wonder if you can implement a system to add plugins to the findbar. That's similar to option 4, but instead of overwriting methods you define user hooks to add custom code. If you're ready to send me such a PR, I'll happily merge it. I guess it's a sensible compromise between risking merge hell and adding flexibility to the library.
BTW, you regex search might be interesting to other users, too. What about adding it to a demo page of the showcase?
Come to think of it, the plugins might bring back fuzzy search and searching multiple terms. I had to abandon these improvements some time ago because I ran into the merge hell. It would be nice to get them back!
That's basically what I want... regex search and multiple terms. I basically only need to add one line in /web/pdf_find_controller.js
but once I do that I'm forced to "fork" pdf_find_controller.js
, pdf_find_utils.js
and ui_utils.js
. I then need to "overwrite" the instance of findController in each page. Here is my code so far:
import type { PDFPageView } from 'ngx-extended-pdf-viewer';
import { PDFFindController } from './pdfjs_overrides/pdf_find_controller';
export function customFindControllerFactor() {
const PDFViewerApplication = (window as any).PDFViewerApplication;
const eventBus = PDFViewerApplication.eventBus;
const linkService = PDFViewerApplication.pdfLinkService;
// Create a custom FindController
const findController = new PDFFindController({ eventBus, linkService } as any);
// Replace findController on each page
eventBus.on('pagesloaded', () => {
findController.setDocument(PDFViewerApplication.pdfDocument);
const pages: PDFPageView[] = (PDFViewerApplication.pdfViewer as any)._pages;
pages.forEach(page => {
(page as any)._textHighlighter.findController = findController;
});
});
return findController;
}
Not very clean or maintainable. If we were to write a plugin system I'm guessing it would be almost the same thing. Something to override the built in PDFFindController
.
Edit: And this to overide the instance in the pdfjs application:
pdfLoaded() {
const findController = customFindControllerFactor();
this.pdfViewerApplication.findController = findController as unknown as FindController;
}
I'm thinking of this approach. Both would be changes to stephanrauh/pdf.js
that would allow ngx-extended-pdf-viewer (or end users) to "plug" the find feature.
/web/pdf_find_controller.js
(in stephanrauh/pdf.js
) to be public (replace this.#method
s with this._method
. If the methods are public it would be easy to create a new class that extends PDFFindController
. The developer extending PDFFindController
doesn't need to override every method since we can call super._method
in the base class.PDFViewerApplication._initializeViewerComponents
to look for constructor overrides for PDFFindController
in AppOptions
(or similar). That would allow end users to replace PDFFindController
with the extended class without jumping through the hoops of iterating over all the pages like I did above.Not quite a plugins... but close. You could resonably do this to other objects as well.
Here is what I would guss my "plugin" would look like:
class CustomPDFFindController extends PDFFindController {
_convertToRegExpString(query, hasDiacritics) {
const { matchRegex } = this._state;
if (matchRegex) return [false, query];
return super._convertToRegExpString(query, hasDiacritics);
}
}
Registered something like:
PDFViewerApplicationOptions.set('FindControllerConstructorOveride', CustomPDFFindController);
Yes, that's an option. You're definitely on the right track. It's just that I'd like to find a minimally invasive approach. And it seems I'm less shy about manipulating prototypes. :)
customQueryConversion(query) {
return query; // idempotent default implementation
}
#calculateMatch(pageIndex) {
let query = this.#query;
if (query.length === 0) {
return; // Do nothing: the matches should be wiped out already.
}
const { caseSensitive, entireWord } = this.#state;
const pageContent = this._pageContents[pageIndex];
const hasDiacritics = this._hasDiacritics[pageIndex];
let isUnicode = false;
// allow users to add their plugins <<<<<<<<<
query = customQueryConversion(query);
// end of the modification <<<<<<<<<<<<<<<
if (typeof query === "string") {
[isUnicode, query] = this.#convertToRegExpString(query, hasDiacritics);
} else {
...
and
(PDFViewerApplication as any).pdfViewer.findController.customQueryConversion = () => { /* put your magic here */ };
How do you like the idea?
That might work in some cases... but in my case I need to conditionally skip #convertToRegExpString
. I'd either need a way to say use only customQueryConversion
in some cases, skiping the #convertToRegExpString
or be able to call #convertToRegExpString
within customQueryConversion
; which is not possible for privates.
If we wanted that it would have to be something like:
customQueryConversion(query) {
return false; // idempotent default implementation
}
#calculateMatch(pageIndex) {
...
customQuery = customQueryConversion(query);
if (customQuery !== false) {
// customQuery is not false, use this
query = customQuery;
} else {
// otherwise do the normal stuff
let isUnicode = false;
if (typeof query === "string") {
...
}
// then do everything else
...
then
(PDFViewerApplication as any).pdfViewer.findController.customQueryConversion = () => {
const { matchRegex } = this.state;
if (!matchRegex) return false;
/* put your magic here */
};
That shouldn't be a problem - customQueryConversion
can return more than one value:
customQueryConversion(query) {
return {
skipDefaultBehavior: false;
query; // idempotent default implementation
}
}
#calculateMatch(pageIndex) {
...
// allow users to add their plugins <<<<<<<<<
{ query, skipDefaultBehavior} = customQueryConversion(query);
if (skipDefaultBehavior) {
// do nothing;
}
else if (typeof query === "string") { // end of the modification <<<<<<<<<<<<<<<
We've sent our messages simultaneously - and we're thinking along similar lines!
I suppose my approach isn't perfect yet. But I hope we can polish it until it's useful to support most (or every?) use-case. Things we still need to consider (or at least I should consider it, I'm not familiar with your application):
FindController
.I also might have some modifications to #calculateRegExpMatch
. Obviously I'll do whatever you think is best for your maintance. I'd be happy if I can just do this. Notice that i'm putting matchRegex
into the findController state. That's easy enough to do when I dispatch "find".
const originalCalculateMatch = findController._calculateMatch; // only works if _calculateMatch is public
findController._calculateMatch = function (query) {
const { matchRegex } = this.state;
if (!matchRegex) return originalCalculateMatch(query);
/* put your magic here */
};
const originalCalculateRegExpMatch = findController._calculateRegExpMatch;
findController._calculateRegExpMatch = function (query) {
const { matchRegex } = this.state;
if (!matchRegex) return originalCalculateRegExpMatch(query);
/* put your magic here */
};
Then the only change would be the search and replace #calculateMatch
-> _calculateMatch
and same for #calculateRegExpMatch
.
Might need a few .bind()
here and there...
FYI.. overriding the find bar is working great. Adding the "matchRegex" buttons is not too difficult:
My override of calculateMatch
is not right... but you get the idea.
Edit:
This is what I need:
const originalConvertToRegExpString = findController._convertToRegExpString.bind(findController); // only works if _convertToRegExpString is public
findController._convertToRegExpString = function (query: any) {
const { matchRegex } = this.state;
if (matchRegex) return [false, query];
return originalConvertToRegExpString(query);
};
@stephanrauh Let me know if the PR to pdfjs works for now. I tested locally and allowed me to override the convertToRegExpString
method as shown. If that works, and once it's merged, I can try to add a demo to ngx-extended-pdf-viewer.
Yes, everything seems to work fine. Thanks!
Do we also need to apply these changes to the non-bleeding edge version?
I already did. I just didn't push my changes yet. Currently, I'm merging the latest changes of pdf.js to the bleeding edge, and like so often, that's a bit painful. :)
Your improvement has landed with version 20.1.0. Now I'm looking forward to your demo.
Enjoy! Stephan
Regarding your code snippet:
const originalConvertToRegExpString = findController._convertToRegExpString.bind(findController); // only works if _convertToRegExpString is public
findController._convertToRegExpString = function (query: any) {
const { matchRegex } = this.state;
if (matchRegex) return [false, query];
return originalConvertToRegExpString(query);
};
I prefer the fat arrow syntax over using function(){}
. It's mostly a matter of taste, but it's a better match to how my brain works. I suppose it also allows you to get rid of the .bind()
bit. That's something that's always confusing me, so take it with a grain of salt, but I think that's one of the reasons why I prefer fat arrows: the closure syntax automatically means we're using variables from the enclosing scope, and more often than not, that's what you expect intuitively.
If you use the fat arrow I don't be able to access this.state
... instead would need to use findController directly. Something like:
const originalConvertToRegExpString = findController._convertToRegExpString.bind(findController); // only works if _convertToRegExpString is public
findController._convertToRegExpString = (query: any) => {
const { matchRegex } = findController.state;
if (matchRegex) return [false, query];
return originalConvertToRegExpString(query);
};
Would still need to bind the original _convertToRegExpString
or else it will also lose the this
binding.
The "proper" way is likely this; to ensure the correct bindings. function
can reference this
(fat arrow cannot) and the this
context of our overide is passed down into the originalConvertToRegExpString
. Messing with methods on classes like this can be tricky.
const originalConvertToRegExpString = findController._convertToRegExpString;
findController._convertToRegExpString = function (query: any) {
const { matchRegex } = this.state;
if (matchRegex) return [false, query];
return originalConvertToRegExpString.call(this, query);
};
Even more proper way is to expose the constructor and let the user extend it using class X extends Y
as I outlined above... that would mean even more changes to pdfjs to maintain.
Edit: Meaning if I had access to PDFFindController
and could swap out the constructor after viewer
is loaded but before it is initialized I could do this:
class CustomPDFFindController extends PDFFindController {
_convertToRegExpString(query) {
const { matchRegex } = this._state;
if (matchRegex) return [false, query];
return super._convertToRegExpString(query);
}
}
I'm not a friend of flame wars - so there's not "proper" way in my book. There are always at least two ways that work. I just wanted to make you think about a simpler implementation, and obviously, I managed. No matter why approach you choose - now I know you thought thoroughly about it!
No flame wars from me... thank you for your work on this project.
Actually, I wanted to say that I don't want to start a flame war. :) And I really enjoy discussing with you!
Same to you!
Working like a charm:
I need to figure out how and where to handle invalid regex then I can see about adding to the demo.
That looks promising!
Hi, thanks for creating this feature. I'm using it to programmatically find several phrases at once.
I'd like for these highlights to remain even after a user begins to do their own manual search over a document. Currently the viewer removes all highlights before highlighting the new word or phrase.
After investigating with the browser dev tools I see that a class name 'highlight' is being added to the span tag surrounding word(s) of the search phrase. CSS is looking for this class to add the coloring.
I suppose one way to solve this issue would be to somehow change the naming of these classes, e.g. call it something different such as 'highlight2', so that the attribute doesn't get removed if/when a user uses the search bar to search for a word/phrase.
Is there a better way to do what I'm describing?
@rmdorsey I would love to find a way to add additional classes when doing the highlight. Since the search allows multiple phases, add an additional unique class for each phase.
Unfortunatly, the highlight feature is so deep in pdfjs this might be difficult. I wonder if it would be easier to use annotations?
@rmdorsey I would love to find a way to add additional classes when doing the highlight. Since the search allows multiple phases, add an additional unique class for each phase.
Unfortunatly, the highlight feature is so deep in pdfjs this might be difficult. I wonder if it would be easier to use annotations?
The search allows for multiple phases? Where can I find an example of that?
Regarding annotations, all I've seen on annotations is here: https://pdfviewer.net/extended-pdf-viewer/export-annotations I'm not sure how I'd know where to begin and end any drawings over the text. Is there another relevant annotation example you could point me to?
Well, the PDFjs API allows for multiple search terms; but the UI doesn't use that In fact later these terms get concatinated into a single regex... which is what will make it hard to manage different classes per term.
I don't really know anything about annotations.
Well, the PDFjs API allows for multiple search terms; but the UI doesn't use that In fact later these terms get concatinated into a single regex... which is what will make it hard to manage different classes per term.
I don't really know anything about annotations.
Interesting. The documentation is difficult to navigate through assuming I've found the right site... Is this it? https://mozilla.github.io/pdf.js/api/ Where can I find information about the multi-search?
I'm using regex already via your solution to conduct multi-search, but I'm curious to try their solution.
I want to tag all multi-search terms with the same class (e.g. "highlight2"). If there's 8 found phrases they'll all receive the same class name "highlight2". This way if the user uses the ui find feature it won't remove "highlight2" class because the ui find feature uses the default "highlight" class.
@rmdorsey If you're interested in annotations, have a look at the PDF specification. Annotations are part of the PDF file and allow for adding interactive elements like forms or special-purpose elements like signatures. They're rendered as an additional layer above the canvas and the text layer.
I'm afraid the regex search isn't documented because it's part of the viewer. The documentation only covers the PDF rendering engine.
I've tested your approach of adding custom classes to the spans of the text layer here: https://pdfviewer.net/extended-pdf-viewer/textlayer. You're right, as soon as the user starts searching, the custom CSS classes are overwritten.
Maybe you could catch the (textLayerRendered)
event, copy the text layer, and add your custom highlights to your copy of the text layer.
Maybe you need to remove the text from the spans. That, in turn, requires you to set the width as an actual attribute before deleting the text.
Possibly adding annotations is the simpler approach. :)
"Maybe you could catch the (textLayerRendered) event, copy the text layer, and add your custom highlights to your copy of the text layer."
Yes - I'm currently writing something to do just that. The biggest issue now is finding phrase(s) being searched for. The PDFs I'm searching sometimes have a word broken up across multiple span tags. This makes finding the phrase very difficult because I have to compare each letter in the pdf document to each letter in the search phrase while stepping through numerous span tags. After it's found a match I then add the classes to the span tags and in some cases have to create a span tag with the class when there's only a partial match within the original span tag.
It's a lot to keep track of!
"Possibly adding annotations is the simpler approach."
I'm not sure... I think that would require me to know the precise location of the search phrase in the document before drawing the highlight. I'm not sure how I would measure for the placement of all the annotations.... Also, based on my experience with another viewer adding annotations covers up the text and prevents the user from being able to copy the phrase from the PDF. I'm not sure if this is the case with your viewer.
Thank you for taking the time to read and reply! Overall your project is great!
At the moment, I'm trying to remember how the find controller works. I suppose it makes sense to provide a programmatic API allowing you to get the actual HTML elements that match. I suppose it also makes sense to allow for custom highlights, and to make them survive user searches. Unfortunately, the find controller of pdf.js works asynchronously, which makes everything more difficult, and it hasn't been written with extensibility in mind.
I see that you opened an issue regarding our discussion. Is there a rough estimate for how long implementation might take?
Not yet. At the moment, I just feel I ought to improve the find controller and the programmatic API and open it for customer hooks. But I don't know yet if that's possible, how to do it, and if it's a good idea in the long run. I'm 90% sure I can make you guys happy, but there's no guarantee, nor is there a timetable.
Regarding the find by regex feature; Is there a way to prevent the viewer from doing an auto jump to the page where the search result is?
Currently not - but I think it should be easy to implement this feature.
Not yet. At the moment, I just feel I ought to improve the find controller and the programmatic API and open it for customer hooks. But I don't know yet if that's possible, how to do it, and if it's a good idea in the long run. I'm 90% sure I can make you guys happy, but there's no guarantee, nor is there a timetable.
Hi Stephan, thanks again for all you're doing. Just checking in to see if there's an update regarding this feature request.
Thanks,
Matt
Not really, but I've finally managed to publish the huge refactoring called version 21. This, in turn, means I can now afford time to your ticket.
You might be interested in the changes I've added in version 21.2.0. It's not precisely what you're looking for, but it already gives you new options.
@stephanrauh That's very cool. Is the customFindController
a seperate instance of PDFjs's FindController
or something maintained in ngx-extended-pdf-viewer?
It's a separate instance. Basically, I simply duplicated the line findController=new PdfFindController()
.
@Hypercubed @rmdorsey I believe I've found a nice solution. Would you like to define your own custom find controller, without being restricted by the pdf.js implementation? As it turns out that's possible. The API of PDFFindController
is suprisingly small. Basically, it only reacts to a few events and it doesn't offer any public API.
So it's easy to write your own find controller. I assume most people prefer to extend or modify the original find controller, and this is also possible. My prototype https://github.com/stephanrauh/extended-pdf-viewer-showcase/commit/aebdf6b8d63f459a036d93d93e40fb6690be97ad shows the idea.
How do you like it?
Mostly just curious. I'm still doing the overide of the methods we exposed and using the existing UI.
Well, that's still possible, and there's nothing wrong with that. What I like about my idea is that it opens the way to a find controller that's written in TypeScript and integrates seamlessly into Angular.
Version 21.3.0 has landed, offering you the option to define your own custom controller.
I'm pretty sure the document needs a lot of improvement. The current version is here: https://pdfviewer.net/extended-pdf-viewer/custom-find.
Enjoy! Stephan
Hi Stephan, thanks for this!
I'm trying to use it now by creating a new angular app and bringing in the newest version of the player.
I've set things up like you have here in your showcase repo: https://github.com/stephanrauh/extended-pdf-viewer-showcase/tree/main/src/app/extended-pdf-viewer/custom-find
Your Angular skills are very advanced... I'm wondering if you can give me any pointers regarding these errors I'm seeing...
I have a need to add regex search to the find toolbar. I think I have a clear understanding of ngx-extended-pdf-viewer custom components; etc. I can create custom components that I need but the search functionality itself is buried deep in pdfjs; specifically private methods in PDFFindController.
Options:
ngx-extended-pdf-viewer
andstephanrauh/pdf.js
as PRs to both projects.stephanrauh/pdf.js
and create custom component on my end to expose it.ngx-extended-pdf-viewer
to allow overriding PDFFindController so that I can maintain my own "fork" of just that class.stephanrauh/pdf.js
to make all private methods public; allowing me to override methods inpdfViewer.findController
on my end.stephanrauh/pdf.js
.I feel that # 4 would be the least impactful to your projects; but obviously will add some complexity merging upstream pdfjs changes. I'm willing an able to do the code changes myself... but would like to know if you have any thoughts on the best way to proceed.
Thank you.