ljvmiranda921 / prodigy-pdf-custom-recipe

Custom recipe and utilities for document processing
198 stars 20 forks source link

Support for multi-page pdf documents? #8

Open jetsonearth opened 2 years ago

jetsonearth commented 2 years ago

Hi @ljvmiranda921 , came here after reading your beautifully-written A framework for designing document processing solutions article, thank you for sharing! I have some pdf documents that I want to perform custom NER on; these documents include single-page and multi-page documents.

I have a few questions:

Just got my Prodi.gy license today and still working on learning the tool. Thanks!

ljvmiranda921 commented 2 years ago

Hi @jetsonai11 , thanks for dropping by!

Does your workflow support NER on multi-page documents as well?

I haven't tested this on multi-page documents so I may not be sure how well it would work. It might be easier to split the pages themselves and treat them as one. Of course the difficulty is if you have information that cuts through the next page.

Would I have to convert all the documents into images first, store them in a directory, and the feed the images into your pipeline for annotation and training?

Yes, you need to convert them into images first.

Will I need to split the dataset for training and testing myself, or will Prodi.gy do it for me?

You need to do the splitting yourself.