lublak / pdfdataextract

Extract data from a pdf with pure javascript
MIT License
25 stars 5 forks source link

[Status] Version 4.0 #10

Open lublak opened 10 months ago

lublak commented 10 months ago

Development is slow due to private matters. But the project is still alive. The development can be followed in the pull request https://github.com/lublak/pdfdataextract/pull/9. This is a very large function update. It is also very complete because all functions that are called internally by pdfjs have to be analysed and the possibilities of the contents of the function are reproduced in a structure. This requires reading a lot of internal source code of the pdfjs library. With this, all possible data from a PDF file can be read out comfortably by pdfjs. In addition, the latest version of all libraries is set here. There is currently a breaking change in this pull request due to the new version of pdfjs (https://github.com/mozilla/pdf.js/pull/14527). Two possibilities are currently on the list. The first would be a breaking change also in this library in version 4.0. The second possibility would be to use the own implementation of the content extraction. Whether this makes it possible to restore the old state is still uncertain and must be tested after the completion of this function.

lublak commented 3 months ago

Information about svg support. The idea was to support svg directly with the 4.0 update. However, pdfjs has discontinued support and I will need my own implementation. But I would first focus on getting the general functionality of 4.0 ready. So that svg is added in version 4.x. When exactly that will be remains to be seen.