scientist-softserv / derivative_rodeo

An ETL Ecosystem for Derivative Processing.
Other
0 stars 0 forks source link

🎁 Improve word coordinates generator #34

Closed kirkkwang closed 1 year ago

kirkkwang commented 1 year ago

This commit will ensure that the word coordinates generator will have unique values. This has been observed in IIIF Print's version of the same generator where the same word coordinates appear multiple times and results in the UV having multiple annotations for the same word at the same place. This also will set up the text generator for a more Tesseract like text file where extra spaces are omitted for an overall cleaner output. This also will set up the alto generator to not have duplicate word coordinates as well.

Text (Before) ```txt _A FEARFUL ADVENTURE. ‘The Missouri. Republican, in a letter from a Kansas correspondent, has the fol- lowing: “At St. Josephs Tsaw Mr, A. 'T. Gor- man, of New York, who had just come: in from the mountains in such a state of pros- tration and affiiction as could only have been occasioned by such exposure, hard- ship and suffering, as perhaps no other man ever snrvived. din company with a Canadian Frenchman and two Kentucki- ans he-left the country of the Blackfeet Indians last -Fall.:to join. Culverson and party at Fort Pierre and ¢ accompany them to the states. They arrived at Fort Pierre two days after Calverson’s departure, and hurried on after, in the hope of overtaking him. On the third day one of those snow ```
Text (After) ```txt _A FEARFUL ADVENTURE. ‘The Missouri. Republican, in a letter from a Kansas correspondent, has the fol- lowing: “At St. Josephs Tsaw Mr, A. 'T. Gor- man, of New York, who had just come: in from the mountains in such a state of pros- tration and affiiction as could only have been occasioned by such exposure, hard- ship and suffering, as perhaps no other man ever snrvived. din company with a Canadian Frenchman and two Kentucki- ans he-left the country of the Blackfeet Indians last -Fall.:to join. Culverson and party at Fort Pierre and ¢ accompany them to the states. They arrived at Fort Pierre two days after Calverson’s departure, and hurried on after, in the hope of overtaking him. On the third day one of those snow ```
Word Coords (Before) ```rb {"width"=>1261, "height"=>1744, "coords"=> {"_A "=>[[155, 59, 92, 70]], "FEARFUL "=>[[272, 28, 328, 85]], "ADVENTURE. \n" + " \n" + " \n" + " \n" + " \n" + " \n" + " "=> [[622, 22, 451, 107], [622, 22, 451, 107], [622, 22, 451, 107], [622, 22, 451, 107]], "‘The "=>[[69, 188, 124, 52]], "Missouri. "=>[[225, 191, 239, 53]], "Republican, "=>[[517, 174, 348, 87]], "in "=>[[906, 195, 48, 51], [599, 631, 51, 51], [526, 1587, 52, 50]], "a "=>[[1007, 211, 33, 36], [167, 293, 32, 35], [824, 648, 31, 34]], "letter \n" + " \n" + " "=> [[1087, 195, 155, 51], [1087, 195, 155, 51]], "from "=>[[11, 270, 128, 56], [12, 625, 128, 51]], "Kansas "=>[[229, 277, 211, 52]], "correspondent, "=>[[464, 281, 424, 93]], "has "=>[[903, 281, 105, 106]], "the "=> [[1039, 282, 89, 51], [163, 627, 94, 52], [337, 1155, 88, 50], [832, 1157, 95, 51], [99, 1411, 86, 51], [606, 1587, 87, 53], [327, 1675, 89, 51]], "fol- \n" + " \n" + " "=>[[1149, 279, 89, 53], [1149, 279, 89, 53]], "lowing: \n" + " \n" + " \n" + "\n" + " \n" + " "=> [[12, 361, 212, 68], [12, 361, 212, 68], [12, 361, 212, 68]], "“At "=>[[86, 452, 120, 49]], "St. "=>[[234, 449, 77, 55]], "Josephs "=>[[339, 415, 228, 105]], "Tsaw "=>[[595, 454, 156, 51]], "Mr, "=>[[781, 456, 103, 53]], "A. "=>[[915, 457, 67, 51]], "'T. "=>[[1011, 456, 63, 52]], "Gor- \n" + " \n" + " "=> [[1116, 407, 128, 101], [1116, 407, 128, 101]], "man, "=>[[11, 554, 143, 48]], "of "=> [[177, 539, 51, 51], [1043, 633, 54, 50], [724, 1157, 59, 51], [874, 1589, 55, 51], [865, 1676, 57, 51]], "New "=>[[260, 539, 129, 53]], "York, "=>[[417, 542, 163, 64]], "who "=>[[607, 544, 117, 52]], "had "=>[[752, 544, 107, 53]], "just "=>[[861, 546, 127, 67]], "come: "=>[[1012, 562, 158, 36]], "in \n" + " \n" + " \n" + " \n" + " \n" + " \n" + " "=> [[1194, 546, 48, 51], [1194, 546, 48, 51], [1194, 546, 48, 51], [1194, 546, 48, 51]], "mountains "=>[[287, 631, 289, 49]], "such "=>[[678, 632, 124, 51], [608, 807, 140, 52]], "state "=>[[882, 636, 137, 47]], "pros- \n" + " \n" + " \n" + " \n" + " \n" + " \n" + " "=> [[1109, 637, 152, 62], [1109, 637, 152, 62], [1109, 637, 152, 62], [1109, 637, 152, 62]], "tration "=>[[12, 714, 197, 51]], "and "=> [[238, 717, 101, 66], [155, 896, 101, 48], [669, 1071, 105, 51], [647, 1326, 103, 50]], "affiiction "=>[[372, 717, 239, 52]], "as "=>[[645, 734, 56, 35], [590, 913, 59, 34]], "could "=>[[746, 719, 150, 53]], "only "=>[[950, 722, 117, 65]], "have \n" + " \n" + " "=> [[1112, 719, 132, 53], [1112, 719, 132, 53]], "been "=>[[15, 803, 124, 51]], "occasioned "=>[[168, 805, 306, 53]], "by "=>[[514, 808, 68, 65]], "exposure, "=>[[795, 824, 270, 58]], "hard- \n" + " \n" + " "=> [[1074, 808, 171, 51], [1074, 808, 171, 51]], "ship "=>[[11, 892, 114, 68]], "suffering, "=>[[296, 894, 265, 69]], "perhaps "=>[[699, 898, 226, 68]], "no "=>[[974, 913, 68, 36]], "other \n" + " \n" + " "=> [[1090, 898, 154, 50], [1090, 898, 154, 50]], "man "=>[[14, 995, 114, 35]], "ever "=>[[156, 997, 123, 34]], "snrvived. "=>[[306, 983, 254, 51]], "din "=>[[635, 984, 68, 50]], "company "=>[[732, 999, 253, 54]], "with "=>[[1042, 984, 119, 65]], "a \n" + " \n" + " "=>[[1214, 1002, 31, 34], [1214, 1002, 31, 34]], "Canadian "=>[[15, 1065, 268, 61]], "Frenchman "=>[[312, 1066, 326, 54]], "two "=>[[804, 1077, 104, 45], [15, 1505, 105, 45]], "Kentucki- \n" + " \n" + " "=> [[961, 1069, 285, 54], [961, 1069, 285, 54]], "ans "=>[[12, 1169, 93, 33]], "he-left "=>[[132, 1153, 178, 52]], "country "=>[[452, 1162, 220, 61]], "Blackfeet \n" + " \n" + " "=> [[972, 1157, 276, 52], [972, 1157, 276, 52]], "Indians "=>[[16, 1239, 212, 72]], "last "=>[[247, 1241, 109, 69]], "-Fall.:to "=>[[366, 1240, 233, 48]], "join. "=>[[623, 1244, 172, 71]], "Culverson "=>[[800, 1243, 300, 67]], "and \n" + " \n" + " "=> [[1140, 1245, 106, 64], [1140, 1245, 106, 64], [1145, 1502, 101, 50], [1145, 1502, 101, 50]], "party "=>[[14, 1309, 153, 79]], "at "=>[[195, 1307, 58, 66], [837, 1420, 57, 45]], "Fort "=>[[282, 1305, 131, 68], [914, 1414, 136, 51]], "Pierre "=>[[443, 1313, 180, 63]], "¢ "=>[[774, 1358, 15, 17]], "accompany "=>[[773, 1313, 316, 80]], "them \n" + " \n" + " "=> [[1109, 1303, 137, 73], [1109, 1303, 137, 73]], "to "=>[[15, 1419, 55, 43]], "states. "=>[[213, 1418, 180, 45]], "They "=>[[432, 1412, 147, 68]], "arrived "=>[[608, 1414, 208, 52]], "Pierre \n" + " \n" + " "=> [[1067, 1414, 178, 52], [1067, 1414, 178, 52]], "days "=>[[147, 1500, 129, 64]], "after "=>[[303, 1497, 133, 53]], "Calverson’s "=>[[466, 1497, 328, 54]], "departure, "=>[[821, 1500, 295, 68]], "hurried "=>[[15, 1587, 207, 51]], "on "=>[[252, 1603, 64, 35]], "after, "=>[[343, 1586, 155, 64]], "hope "=>[[719, 1587, 134, 69]], "overtaking \n" + " \n" + " "=> [[939, 1589, 310, 67], [939, 1589, 310, 67]], "him. "=>[[18, 1675, 142, 59]], "On "=>[[214, 1672, 82, 54]], "third "=>[[442, 1675, 142, 52]], "day "=>[[608, 1675, 103, 67]], "one "=>[[740, 1691, 97, 36]], "those "=>[[942, 1675, 148, 52]], "snow \n" + " \n" + " \n" + " \n" + " \n" + " \n" + "\n"=> [[1110, 1692, 138, 35], [1110, 1692, 138, 35], [1110, 1692, 138, 35], [1110, 1692, 138, 35], [1110, 1692, 138, 35], [1110, 1692, 138, 35], [1110, 1692, 138, 35]]}} ```
Word Coords (After) ```rb {"width"=>1261, "height"=>1744, "coords"=> {"_A"=>[[155, 59, 92, 70]], "FEARFUL"=>[[272, 28, 328, 85]], "ADVENTURE."=>[[622, 22, 451, 107]], "‘The"=>[[69, 188, 124, 52]], "Missouri."=>[[225, 191, 239, 53]], "Republican,"=>[[517, 174, 348, 87]], "in"=> [[906, 195, 48, 51], [1194, 546, 48, 51], [599, 631, 51, 51], [526, 1587, 52, 50]], "a"=> [[1007, 211, 33, 36], [167, 293, 32, 35], [824, 648, 31, 34], [1214, 1002, 31, 34]], "letter"=>[[1087, 195, 155, 51]], "from"=>[[11, 270, 128, 56], [12, 625, 128, 51]], "Kansas"=>[[229, 277, 211, 52]], "correspondent,"=>[[464, 281, 424, 93]], "has"=>[[903, 281, 105, 106]], "the"=> [[1039, 282, 89, 51], [163, 627, 94, 52], [337, 1155, 88, 50], [832, 1157, 95, 51], [99, 1411, 86, 51], [606, 1587, 87, 53], [327, 1675, 89, 51]], "fol-"=>[[1149, 279, 89, 53]], "lowing:"=>[[12, 361, 212, 68]], "“At"=>[[86, 452, 120, 49]], "St."=>[[234, 449, 77, 55]], "Josephs"=>[[339, 415, 228, 105]], "Tsaw"=>[[595, 454, 156, 51]], "Mr,"=>[[781, 456, 103, 53]], "A."=>[[915, 457, 67, 51]], "'T."=>[[1011, 456, 63, 52]], "Gor-"=>[[1116, 407, 128, 101]], "man,"=>[[11, 554, 143, 48]], "of"=> [[177, 539, 51, 51], [1043, 633, 54, 50], [724, 1157, 59, 51], [874, 1589, 55, 51], [865, 1676, 57, 51]], "New"=>[[260, 539, 129, 53]], "York,"=>[[417, 542, 163, 64]], "who"=>[[607, 544, 117, 52]], "had"=>[[752, 544, 107, 53]], "just"=>[[861, 546, 127, 67]], "come:"=>[[1012, 562, 158, 36]], "mountains"=>[[287, 631, 289, 49]], "such"=>[[678, 632, 124, 51], [608, 807, 140, 52]], "state"=>[[882, 636, 137, 47]], "pros-"=>[[1109, 637, 152, 62]], "tration"=>[[12, 714, 197, 51]], "and"=> [[238, 717, 101, 66], [155, 896, 101, 48], [669, 1071, 105, 51], [1140, 1245, 106, 64], [647, 1326, 103, 50], [1145, 1502, 101, 50]], "affiiction"=>[[372, 717, 239, 52]], "as"=>[[645, 734, 56, 35], [590, 913, 59, 34]], "could"=>[[746, 719, 150, 53]], "only"=>[[950, 722, 117, 65]], "have"=>[[1112, 719, 132, 53]], "been"=>[[15, 803, 124, 51]], "occasioned"=>[[168, 805, 306, 53]], "by"=>[[514, 808, 68, 65]], "exposure,"=>[[795, 824, 270, 58]], "hard-"=>[[1074, 808, 171, 51]], "ship"=>[[11, 892, 114, 68]], "suffering,"=>[[296, 894, 265, 69]], "perhaps"=>[[699, 898, 226, 68]], "no"=>[[974, 913, 68, 36]], "other"=>[[1090, 898, 154, 50]], "man"=>[[14, 995, 114, 35]], "ever"=>[[156, 997, 123, 34]], "snrvived."=>[[306, 983, 254, 51]], "din"=>[[635, 984, 68, 50]], "company"=>[[732, 999, 253, 54]], "with"=>[[1042, 984, 119, 65]], "Canadian"=>[[15, 1065, 268, 61]], "Frenchman"=>[[312, 1066, 326, 54]], "two"=>[[804, 1077, 104, 45], [15, 1505, 105, 45]], "Kentucki-"=>[[961, 1069, 285, 54]], "ans"=>[[12, 1169, 93, 33]], "he-left"=>[[132, 1153, 178, 52]], "country"=>[[452, 1162, 220, 61]], "Blackfeet"=>[[972, 1157, 276, 52]], "Indians"=>[[16, 1239, 212, 72]], "last"=>[[247, 1241, 109, 69]], "-Fall.:to"=>[[366, 1240, 233, 48]], "join."=>[[623, 1244, 172, 71]], "Culverson"=>[[800, 1243, 300, 67]], "party"=>[[14, 1309, 153, 79]], "at"=>[[195, 1307, 58, 66], [837, 1420, 57, 45]], "Fort"=>[[282, 1305, 131, 68], [914, 1414, 136, 51]], "Pierre"=>[[443, 1313, 180, 63], [1067, 1414, 178, 52]], "¢"=>[[774, 1358, 15, 17]], "accompany"=>[[773, 1313, 316, 80]], "them"=>[[1109, 1303, 137, 73]], "to"=>[[15, 1419, 55, 43]], "states."=>[[213, 1418, 180, 45]], "They"=>[[432, 1412, 147, 68]], "arrived"=>[[608, 1414, 208, 52]], "days"=>[[147, 1500, 129, 64]], "after"=>[[303, 1497, 133, 53]], "Calverson’s"=>[[466, 1497, 328, 54]], "departure,"=>[[821, 1500, 295, 68]], "hurried"=>[[15, 1587, 207, 51]], "on"=>[[252, 1603, 64, 35]], "after,"=>[[343, 1586, 155, 64]], "hope"=>[[719, 1587, 134, 69]], "overtaking"=>[[939, 1589, 310, 67]], "him."=>[[18, 1675, 142, 59]], "On"=>[[214, 1672, 82, 54]], "third"=>[[442, 1675, 142, 52]], "day"=>[[608, 1675, 103, 67]], "one"=>[[740, 1691, 97, 36]], "those"=>[[942, 1675, 148, 52]], "snow"=>[[1110, 1692, 138, 35]]}} ```