Open julianmack opened 2 years ago
Could you provide the datalist you generated (you may send it by email)?
Hi - thanks very much. I've just sent now
Hi - thanks very much. I've just sent now
I have not received the email yet. The file might exceed the capacity limit of the email. Could you please upload the file one the GoogleDrive or OneDrive and share with the link?
I have not received the email yet.
Hi @qiaoliang6 that's strange. I sent by WeTranfer to qiaoliang6@hikvision.com but the link expires tomorrow - perhaps it went into junk. In any case - I've sent again just now by sharepoint. Thanks again!
Hi - thanks again for your help. The WeTransfer link is now dead but the Sharepoint should still work - let me know if you had trouble accessing and I will send via another method
To update on this if anyone else is having this issue I added a hack to fix it here: https://github.com/AccelexTechnology/DAVAR-Lab-OCR/commit/ce6d9ce766573362e3354ece5718ca722feaa4c4
^ this was only appropriate as it affected a small proportion of my dataset (~1/5000 samples) - it will give bad results otherwise as I am setting the index to an arbitrary (and hence likely incorrect) non-NaN value
I spent a long time looking at one of the samples from FinTabNet that was causing the issue here:
{"pdf/RCL/2017/page_60_95895.png": {"height": 888, "width": 2065, "content_ann": {"bboxes": [[], [1195, 11, 1529, 43], [], [761, 177, 832, 209], [1013, 116, 1146, 209], [1325, 116, 1399, 209], [1608, 116, 1682, 209], [1855, 116, 2002, 209], [0, 233, 337, 273], [], [], [], [], [], [24, 292, 497, 334], [663, 293, 914, 334], [946, 293, 1197, 334], [1229, 293, 1480, 334], [1512, 293, 1763, 334], [1795, 293, 2046, 334], [24, 353, 475, 394], [752, 354, 914, 394], [1066, 354, 1197, 394], [1349, 354, 1480, 394], [1631, 354, 1763, 394], [1915, 354, 2046, 394], [24, 414, 147, 455], [783, 415, 914, 455], [1066, 415, 1197, 455], [1349, 415, 1480, 455], [1631, 415, 1763, 455], [1915, 415, 2046, 455], [0, 475, 326, 516], [915, 512, 917, 516], [], [], [], [], [24, 535, 472, 576], [732, 536, 914, 576], [1036, 536, 1197, 576], [1319, 536, 1480, 576], [1601, 536, 1763, 576], [1885, 536, 2046, 576], [0, 596, 337, 637], [915, 634, 917, 638], [], [], [], [], [24, 656, 497, 697], [752, 657, 914, 697], [1036, 657, 1197, 697], [1319, 657, 1480, 697], [1601, 657, 1763, 697], [1885, 657, 2046, 697], [24, 717, 452, 758], [803, 718, 914, 758], [1106, 718, 1197, 758], [1389, 718, 1480, 758], [1672, 718, 1763, 758], [1935, 718, 2046, 758], [24, 777, 147, 819], [803, 778, 914, 819], [1106, 778, 1197, 819], [1370, 778, 1480, 819], [1672, 778, 1763, 819], [2006, 778, 2046, 819], [0, 839, 83, 879], [663, 839, 914, 879], [946, 839, 1197, 879], [1229, 839, 1480, 879], [1512, 839, 1763, 879], [1795, 839, 2046, 879]], "texts": ["", "Payments due by period", "", "Total", "Less than 1 year", "1-3 years", "3-5 years", "More than 5 years", "Operating Activities:", "", "", "", "", "", "Operating lease obligations(1)", "$241,468", "$29,420", "$44,191", "$22,644", "$145,213", "Interest on long-term debt(2)", "1,275,346", "250,600", "415,000", "292,665", "317,081", "Other(3)", "879,206", "214,444", "282,570", "150,003", "232,189", "Investing Activities:", "0", "", "", "", "", "Ship purchase obligations(4)", "10,888,494", "2,368,806", "3,063,165", "4,089,153", "1,367,370", "Financing Activities:", "0", "", "", "", "", "Long-term debt obligations(5)", "7,506,312", "1,185,038", "2,047,882", "2,012,922", "2,260,470", "Capital lease obligations(6)", "33,139", "3,476", "7,210", "8,395", "14,058", "Other(7)", "21,552", "8,868", "11,217", "1,467", "\u2014", "Total", "$20,845,517", "$4,060,652", "$5,871,235", "$6,577,249", "$4,336,381"], "texts_tokens": [[], ["P", "a", "y", "m", "e", "n", "t", "s", " ", "d", "u", "e", " ", "b", "y", " ", "p", "e", "r", "i", "o", "d"], [], ["T", "o", "t", "a", "l"], ["L", "e", "s", "s", " ", "t", "h", "a", "n", " ", "1", " ", "y", "e", "a", "r"], ["1", "-", "3", " ", "y", "e", "a", "r", "s"], ["3", "-", "5", " ", "y", "e", "a", "r", "s"], ["M", "o", "r", "e", " ", "t", "h", "a", "n", " ", "5", " ", "y", "e", "a", "r", "s"], ["O", "p", "e", "r", "a", "t", "i", "n", "g", " ", "A", "c", "t", "i", "v", "i", "t", "i", "e", "s", ":"], [], [], [], [], [], ["O", "p", "e", "r", "a", "t", "i", "n", "g", " ", "l", "e", "a", "s", "e", " ", "o", "b", "l", "i", "g", "a", "t", "i", "o", "n", "s", "<sup>", "(", "1", ")", "</sup>"], ["$", "2", "4", "1", ",", "4", "6", "8"], ["$", "2", "9", ",", "4", "2", "0"], ["$", "4", "4", ",", "1", "9", "1"], ["$", "2", "2", ",", "6", "4", "4"], ["$", "1", "4", "5", ",", "2", "1", "3"], ["I", "n", "t", "e", "r", "e", "s", "t", " ", "o", "n", " ", "l", "o", "n", "g", "-", "t", "e", "r", "m", " ", "d", "e", "b", "t", "<sup>", "(", "2", ")", "</sup>"], ["1", ",", "2", "7", "5", ",", "3", "4", "6"], ["2", "5", "0", ",", "6", "0", "0"], ["4", "1", "5", ",", "0", "0", "0"], ["2", "9", "2", ",", "6", "6", "5"], ["3", "1", "7", ",", "0", "8", "1"], ["O", "t", "h", "e", "r", "<sup>", "(", "3", ")", "</sup>"], ["8", "7", "9", ",", "2", "0", "6"], ["2", "1", "4", ",", "4", "4", "4"], ["2", "8", "2", ",", "5", "7", "0"], ["1", "5", "0", ",", "0", "0", "3"], ["2", "3", "2", ",", "1", "8", "9"], ["I", "n", "v", "e", "s", "t", "i", "n", "g", " ", "A", "c", "t", "i", "v", "i", "t", "i", "e", "s", ":"], ["0"], [], [], [], [], ["S", "h", "i", "p", " ", "p", "u", "r", "c", "h", "a", "s", "e", " ", "o", "b", "l", "i", "g", "a", "t", "i", "o", "n", "s", "<sup>", "(", "4", ")", "</sup>"], ["1", "0", ",", "8", "8", "8", ",", "4", "9", "4"], ["2", ",", "3", "6", "8", ",", "8", "0", "6"], ["3", ",", "0", "6", "3", ",", "1", "6", "5"], ["4", ",", "0", "8", "9", ",", "1", "5", "3"], ["1", ",", "3", "6", "7", ",", "3", "7", "0"], ["F", "i", "n", "a", "n", "c", "i", "n", "g", " ", "A", "c", "t", "i", "v", "i", "t", "i", "e", "s", ":"], ["0"], [], [], [], [], ["L", "o", "n", "g", "-", "t", "e", "r", "m", " ", "d", "e", "b", "t", " ", "o", "b", "l", "i", "g", "a", "t", "i", "o", "n", "s", "<sup>", "(", "5", ")", "</sup>"], ["7", ",", "5", "0", "6", ",", "3", "1", "2"], ["1", ",", "1", "8", "5", ",", "0", "3", "8"], ["2", ",", "0", "4", "7", ",", "8", "8", "2"], ["2", ",", "0", "1", "2", ",", "9", "2", "2"], ["2", ",", "2", "6", "0", ",", "4", "7", "0"], ["C", "a", "p", "i", "t", "a", "l", " ", "l", "e", "a", "s", "e", " ", "o", "b", "l", "i", "g", "a", "t", "i", "o", "n", "s", "<sup>", "(", "6", ")", "</sup>"], ["3", "3", ",", "1", "3", "9"], ["3", ",", "4", "7", "6"], ["7", ",", "2", "1", "0"], ["8", ",", "3", "9", "5"], ["1", "4", ",", "0", "5", "8"], ["O", "t", "h", "e", "r", "<sup>", "(", "7", ")", "</sup>"], ["2", "1", ",", "5", "5", "2"], ["8", ",", "8", "6", "8"], ["1", "1", ",", "2", "1", "7"], ["1", ",", "4", "6", "7"], ["\u2014"], ["T", "o", "t", "a", "l"], ["$", "2", "0", ",", "8", "4", "5", ",", "5", "1", "7"], ["$", "4", ",", "0", "6", "0", ",", "6", "5", "2"], ["$", "5", ",", "8", "7", "1", ",", "2", "3", "5"], ["$", "6", ",", "5", "7", "7", ",", "2", "4", "9"], ["$", "4", ",", "3", "3", "6", ",", "3", "8", "1"]], "cells": [[0, 0, 0, 0], [0, 1, 0, 5], [1, 0, 1, 0], [1, 1, 1, 1], [1, 2, 1, 2], [1, 3, 1, 3], [1, 4, 1, 4], [1, 5, 1, 5], [2, 0, 2, 0], [2, 1, 2, 1], [2, 2, 2, 2], [2, 3, 2, 3], [2, 4, 2, 4], [2, 5, 2, 5], [3, 0, 3, 0], [3, 1, 3, 1], [3, 2, 3, 2], [3, 3, 3, 3], [3, 4, 3, 4], [3, 5, 3, 5], [4, 0, 4, 0], [4, 1, 4, 1], [4, 2, 4, 2], [4, 3, 4, 3], [4, 4, 4, 4], [4, 5, 4, 5], [5, 0, 5, 0], [5, 1, 5, 1], [5, 2, 5, 2], [5, 3, 5, 3], [5, 4, 5, 4], [5, 5, 5, 5], [6, 0, 6, 0], [6, 1, 6, 1], [6, 2, 6, 2], [6, 3, 6, 3], [6, 4, 6, 4], [6, 5, 6, 5], [7, 0, 7, 0], [7, 1, 7, 1], [7, 2, 7, 2], [7, 3, 7, 3], [7, 4, 7, 4], [7, 5, 7, 5], [8, 0, 8, 0], [8, 1, 8, 1], [8, 2, 8, 2], [8, 3, 8, 3], [8, 4, 8, 4], [8, 5, 8, 5], [9, 0, 9, 0], [9, 1, 9, 1], [9, 2, 9, 2], [9, 3, 9, 3], [9, 4, 9, 4], [9, 5, 9, 5], [10, 0, 10, 0], [10, 1, 10, 1], [10, 2, 10, 2], [10, 3, 10, 3], [10, 4, 10, 4], [10, 5, 10, 5], [11, 0, 11, 0], [11, 1, 11, 1], [11, 2, 11, 2], [11, 3, 11, 3], [11, 4, 11, 4], [11, 5, 11, 5], [12, 0, 12, 0], [12, 1, 12, 1], [12, 2, 12, 2], [12, 3, 12, 3], [12, 4, 12, 4], [12, 5, 12, 5]], "labels": [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]]}}}
But couldn't see anything wrong that would allow me to programatically filter the bad samples out of the dataset. I think the issue arose in the DavarLoadAnnotations._poly2mask function but didn't debug further than this
Thanks for this open source contribution!
I am seeing the same problem as in issue #37 - although I am using the fix for that issue (commit c85ca3f5b1c00b785ca346882a8983d57287d75f to generate my datalist)
Specifically, when I train on my own data (open source dataset FinTabNet) I see the following errors:
Any help would be very appreciated!