microsoft / OCR-Form-Tools

A set of tools to use in Microsoft Azure Form Recognizer and OCR services.
MIT License
512 stars 175 forks source link

Suggestion: Process Folders in blob storage #137

Open waynebrantley opened 4 years ago

waynebrantley commented 4 years ago

Is your feature request related to a problem? Please describe. The number of files in a model can be quite large when handling many document types - like invoices, etc. Having to maintain all the files in one big list with unique file names, etc is difficult. For each company we deal with - they send around 7 different types of documents. For each type of document we need 5 samples. So with only 10 companies we are already managing 350 files!

Describe the solution you'd like Use files in folders for labeling and model building. At a minimum, that would mean displaying them just like any other root folder item.

Describe alternatives you've considered Without this end up with 1000s of files to manage in one big folder.

kunzms commented 4 years ago

waynebrantley, thank you for your suggestion. You can specify folder path when you create project. BTW. we have some performance issue when there are too many files in a single blob container, we will fix it in near future.

waynebrantley commented 4 years ago

@kunzms That would seem to be ok if you were creating many models off the same blob container. However, if there is only one model with many files - we would want to store those files in folders underneath the model folder (which to your point could be a folder itself!)

Since you have a performance issue with too many files in single container - making it where you just were in one folder at a time might help? Currently we have 350 files to initially form our first model.

xinase commented 4 years ago

@waynebrantley you mentioned you have 350 files, are they all used in training a model (I hope not)? or are they files you need to analyze? we suggest you store the files for training a model in a separate directory.

waynebrantley commented 4 years ago

@xinase they are all in one model.

We open an envelope out of a PO Box. It could be any invoice from any company. See original issue text at the top.

So, what is recommended is you put all documents of all types that you are going to receive. So, we put 5 copies of each document type in - and build the model.

https://cognitive.uservoice.com/forums/921556-form-recognizer/suggestions/37946212-can-we-train-a-single-model-for-multiple-type-of-f

High level initial testing show this works quite nicely. So, yes 350 files for one training model and we would expect to have at LEAST double that number of files as we continue to build. According to that post - this is the correct way to do this. (For example, you have a prebuilt 'receipts' model - there are 1000's of different receipt formats - so I would imagine you guys did the same thing to build that?)

xinase commented 4 years ago

@waynebrantley thank you for your explanation. there are 3 types of services FR offers: pre-built, train w/o label, train with label.

each of them has its own internal design and optimization.

let's focus on train with label, our suggestion is: for each typical type of forms, you collect 5+ representative files and label the fields you want, and train such model.

if 2 invoices from 2 companies share similar layout and fields, you should label+train them as one model, because such model could extract the info.

if 2 invoices are drastically different, you should put them in 2 different models and label+train them differently.

currently, as a rule of thumb, the accuracy would not improve if you have 30+ training files for a model. you could also experiment with 5, 10, 15 training files.

350 files to train one model is not what we designed current algorithm for.

we're improving our algorithm and experimenting cutting edge technologies, so in the future, things might be different, but right now in training with label, 5 to 10 files should generate pretty good model.

thanks

waynebrantley commented 4 years ago

@xinase I understand what you are saying. However, lets say the mailroom scans in 1000 documents. Somehow I would have to know WHICH model they go with? Have a human presort? That certainly defeats much of the purpose. So, how would you recommend we decide which model to use? Given what you are saying (which is in contradiction to the link I posted), I would have around 100-300 models, which is fine - that is not hard. What is hard is deciding which model to use? Please advise?

xinase commented 4 years ago

We’re working on a feature to solve this exact problem. Basically, we will support composing individual models into a “composed model”, where you could simply call Analyze() on the composed model and it would pick the most appropriate model to do the analyze work.

If it’s okay for you to share such data, we’d be happy to run some test with the data you provide.

waynebrantley commented 4 years ago

@xinase Yes, I can provide some data for this as long as I get to try the preview support first! :-) Tell me how to contact you and we will set it up.

xinase commented 4 years ago

my email: xinz_at_microsoft.com let's talk.

RGarrett88 commented 4 years ago

I'm currently working on designing a solution for my company that would need to scale to 11,000+ models potentially. I'm currently trying to figure out if I can get the tagging software to integrate with our software and manage that many sanely. If you do get the composed model working please let me know as that was another issue I was probably going to need to handle soon. Our current plan is to select the model based on ocr-ed mailing address which I'm worried about the accuracy of.

waynebrantley commented 4 years ago

It is coming - as we are in the same boat. Previews of it available in the next couple of months.

AdamMomen commented 3 years ago

@waynebrantley @xinase any updates on this? I have a very similar issue, I am trying to build a model with 9k+ files, I tried loading all of them at once, but there were performance issues, If I split the files into file groups, and trained every set individually, would I be able to merge the models together?

waynebrantley commented 3 years ago

@AdamMomen If I understanding correctly, training with 9k files for one model would be overkill. They recommend 5ish files. @xinase ?