jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.37k stars 148 forks source link

No Space between words #24

Closed Mandeep258 closed 3 years ago

Mandeep258 commented 3 years ago

Hi team,

We are trying to use the library to extract data from the pdf files but there are no spaces in between words and we cannot use that. Is there a way to fix this?

Regards Mandeep

jorisschellekens commented 3 years ago

Hi there,

Please attach the pdf you're using. Then I can debug the issue.

Kind regards, Joris Schellekens

Mandeep258 commented 3 years ago

I cannot upload the documents as they are classified so I tried to find something which might give similar issue.

https://www.researchgate.net/profile/Tamil-Arasan-Bakthavatchalam/publication/349924975_SymSpell_and_LSTM_based_Spell-_Checkers_for_Tamil/links/6047908f299bf1e07867ba85/SymSpell-and-LSTM-based-Spell-Checkers-for-Tamil.pdf?origin=publication_detail

Also is there a way to identify number of pages in the document which can help iterate in the get_text_from_page method.

jorisschellekens commented 3 years ago

I'll have a look at the document.

Are you using my library in a commercial setting?

You can easily get the number of pages from the DocumentInfo object.

Kind regards, Joris Schellekens

Mandeep258 commented 3 years ago

I came across the article https://stackabuse.com/automating-processing-pdf-invoices-in-python-with-borb/ and just used the part which says extract all text, so thought of trying it. I haven't explored any commercial settings, not aware of them yet. We actually are using elasticsearch which helps us with key-word search so extracting data from documents and indexing them is the task. Initially we went with Apache Tika but I'm also trying to explore other libraries which might help. Also one more request, is it possible to speedup the process of loading a document as it takes lot of time and since it is a cpu-bound then using multi-threaded becomes difficult.

jorisschellekens commented 3 years ago

I was just wondering why you keep using "we" (plural). It sounds as if you're talking about a group of people (a development team, or company) rather than just yourself.

Mandeep258 commented 3 years ago

hehe.... A development team, we are working on building pipeline which would go thorough crawl-convert-index. So we. :D

jorisschellekens commented 3 years ago

Then you are using borb in a commercial setting. Please make sure you comply to the AGPL3

Mandeep258 commented 3 years ago

sure, If its not compatible to use then we will not.

jorisschellekens commented 3 years ago

I find it a bit worrying that you were perfectly happy to use my library without checking the license, in a commercial setting.

Mandeep258 commented 3 years ago

I didn't realize that. I apologize as most of python libraries are open-source so I didn't check it and I have uninstalled it, will not be using it anymore.

jorisschellekens commented 3 years ago

There is a difference between being open-source and being "free of charge".

This is also clearly mentioned in the README.

You should think of free (in the context of open source at least) as "free speech" rather than "free beer".

The AGPL3 license allows you to use my product only if you yourself are open source to all of your users.

If you prefer not to be open source, or you can't (due to some NDA or confidentiality agreement), you can purchase a commercial license.

But please, do not confuse "open source" with "I am not supposed to support the developer (s) of this product".

Kind regards, Joris Schellekens