Train on custom data - Githubissues

oliverbj commented 4 years ago

Hi @mawanda-jun

First of all, what an incredible project you have created here! I read the TrainNet research paper and it seems like a very cool idea.

I have a question though - I get mixed results on my own invoices (shipping industry). I was wondering, can I train one of the existing models on my own dataset?

For example, if I have annotated a lot of invoices in below format:

filename            class   xmin            ymin            xmax            ymax
my_invoice_page1.jpeg   table   193         717         389         790
my_invoice_page2.jpeg   table   220         940         362         997

Will I then be able to re-train one of the models and use it?

mawanda-jun commented 4 years ago

Hi and thank you :)First of all, you can definitely train the network on your own dataset, however the performance of the model depends mainly on the quantity of tables you have. Second, it is a quite old project and so I don't know if the pre-trained models and the TF framework are still working. Lastly, it is a proof-of-concept and therefore I must advise you it is not "business ready".If I were you, I'd take this project as reference and try to make my own with updated libraries and new data - I'd definitely look at this dataset: https://github.com/doc-analysis/TableBankI think that the steps I followed are quite straightforward, however the code becomes obsolete really soon and I'm not maintaining it anymore.Tell me if you have any other questions.Have a nice day,GiovanniIl 10 Feb 2020 10:32, Oliver Busk Jensen notifications@github.com ha scritto:Hi @mawanda-jun First of all, what an incredible project you have created here! I read the TrainNet research paper and it seems like a very cool idea. I have a question though - I get mixed results on my own invoices (shipping industry). I was wondering, can I train one of the existing models on my own dataset? For example, if I have annotated a lot of invoices in below format: filename class xmin ymin xmax ymax my_invoice_page1.jpeg table 193 717 389 790 my_invoice_page2.jpeg table 220 940 362 997

Will I then be able to re-train one of the models and use it?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

oliverbj commented 4 years ago

I tried using your project "IntelligentOCR" and got it up and running. It actually did a pretty good job, however, it's clear to see that it detect tables found in academic papers better than tables in invoices for example. (Hence why I want to train it on my own dataset).

I have 2000 invoices - all containing tables, that I wish to train a new model on.

I was thinking something like this:

Annotate the 2000 invoices according to the CSV format like above. (filename, class, xmin, ymin, xmax, ymax)
Split the dataset into "training" and "test"
Train the model

mawanda-jun commented 4 years ago

Yes, I think it would work just fine. Actually, you can also train the network on my original dataset and then use the resulting model as a pre-training task over your own dataset. You definitely should divide your dataset into training and test. Since you have 2K examples, I'd divide it into 70%-30% to have a good representation at test time.Consider to change also the parameters of the blurring of the images, since I think the invoices has more "sparse" tables, am I right?Il 10 Feb 2020 12:20, Oliver Busk Jensen notifications@github.com ha scritto:I tried using your project "IntelligentOCR" and got it up and running. It actually did a pretty good job, however, it's clear to see that it detect tables found in academic papers better than tables in invoices for example. (Hence why I want to train it on my own dataset). I have 2000 invoices - all containing tables, that I wish to train a new model on. I was thinking something like this: Annotate the 2000 invoices according to the CSV format like above. (filename, class, xmin, ymin, xmax, ymax)Split the dataset into "training" and "test"Train the model

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

oliverbj commented 4 years ago

Thanks for your quick reply! Very much appreciated.

In regards to:

Consider to change also the parameters of the blurring of the images

What do you mean with this? Where do I find this parameter regarding blurring - and why does it matter?

since I think the invoices has more "sparse" tables, am I right?

You most definitely are! These invoice tables don't have any clear column/row separating lines, but is still presented in a "table-like/row-like" list.

mawanda-jun commented 4 years ago

Oh, I’m sorry. I thought I implemented it, but I didn’t entirely actually. I am referring to thishttps://www.researchgate.net/publication/320243569_Table_Detection_Using_Deep_Learning paper, in which they made a transformation of the images in order to let the pre-trained-on-normal-images network to adapt to the sparse, b/w documents with tables.

I thought I implemented it entirely, but I found only a b/w version of this transformation at thishttps://github.com/mawanda-jun/TableTrainNet/blob/6b3cee8ed0250d8cd52b374c76597a70121c398c/dataset/img_to_jpeg.py#L22 line.

I think that I didn’t upload that change because it would have involved RGB images, which were far too heavy for my poor laptop.

However, to implement it, there are few changes to be done: you have to change that function, to look for every time the third dimension of images is involved and change it from “1” to “3”. But there is some work to do, and maybe you’re not interested in doing it. :D

Da: Oliver Busk Jensenmailto:notifications@github.com Inviato: lunedì 10 febbraio 2020 14:23 A: mawanda-jun/TableTrainNetmailto:TableTrainNet@noreply.github.com Cc: Giovanni Cavallinmailto:giovanni.cavallin@outlook.com; Mentionmailto:mention@noreply.github.com Oggetto: Re: [mawanda-jun/TableTrainNet] Train on custom data (#5)

Thanks for your quick reply! Very much appreciated.

In regards to:

Consider to change also the parameters of the blurring of the images

What do you mean with this? Where do I find this parameter regarding blurring - and why does it matter?

since I think the invoices has more "sparse" tables, am I right? You most definitely are! These invoice tables don't have any clear column/row separating lines, but is still presented in a "table-like/row-like" list.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmawanda-jun%2FTableTrainNet%2Fissues%2F5%3Femail_source%3Dnotifications%26email_token%3DAI3WBIYJRCFLIDVOMHGFZFDRCFIOFA5CNFSM4KSKR772YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIPOGY%23issuecomment-584120091&data=02%7C01%7C%7Cc56050ca191f4015f16b08d7ae2c753b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637169378273641620&sdata=5EE5pS6K0k9mwWH8QbZY8KnfVKGF4iRQRyN0QueDLKQ%3D&reserved=0, or unsubscribehttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAI3WBI2JH53B3KFOBQJL6S3RCFIOFANCNFSM4KSKR77Q&data=02%7C01%7C%7Cc56050ca191f4015f16b08d7ae2c753b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637169378273655115&sdata=QUaRRdwKt69yKnIwLS%2FaR3r6orgwQCFWNhuV%2FV%2B5fDk%3D&reserved=0.

mawanda-jun / TableTrainNet

Train on custom data #5