jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

Multiple panics on Arxiv.org PDFs #75

Open jlandahl opened 7 months ago

jlandahl commented 7 months ago

I'm attempting to extract the text from multiple PDFs from arxiv.org, and 15 out of the 20 I just attempted resulted in panics, many (but not all) apparently Unicode-related. Here are the links to the PDFs that failed:

Here are some of the errors:

For http://arxiv.org/pdf/2312.00064v1:

Unicode mismatch true fl "fl" Ok("fl") [64258]
Unicode mismatch true fi "fi" Ok("fi") [64257]
Unicode mismatch true fl "fl" Ok("fl") [64258]
thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 16 in map {60: "\u{f8f2}", 208: "Γ", 218: "Ω", 65: "\u{f8f8}", 217: "Ψ", 210: "Θ", 213: "Π", 63: "\u{f8e6}", 50: "\u{f8ee}", 160: " ", 57: "\u{f8fc}", 64: "\u{f8ed}", 212: "Ξ", 55: "\u{f8fa}", 209: "∆", 66: "\u{f8ec}", 49: "\u{f8f6}", 59: "\u{f8fe}", 48: "\u{f8eb}", 67: "\u{f8f7}", 51: "\u{f8f9}", 61: "\u{f8fd}", 52: "\u{f8f0}", 62: "\u{f8f4}", 211: "Λ", 159: "√", 53: "\u{f8fb}", 215: "Υ", 58: "\u{f8f3}", 214: "Σ", 54: "\u{f8ef}", 56: "\u{f8f1}", 216: "Φ"} for <</Type /Font/Subtype /Type1/BaseFont /VSLKGG+CMEX10/FirstChar 0/FontDescriptor 4273 0 R/LastChar 125/ToUnicode 4304 0 R/Widths 4259 0 R>>

For http://arxiv.org/pdf/2312.00140v1:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 0 in map {50: "\u{f8ee}", 54: "\u{f8ef}", 67: "\u{f8f7}", 53: "\u{f8fb}", 48: "\u{f8eb}", 160: " ", 63: "\u{f8e6}", 215: "Υ", 61: "\u{f8fd}", 214: "Σ", 57: "\u{f8fc}", 66: "\u{f8ec}", 60: "\u{f8f2}", 64: "\u{f8ed}", 209: "∆", 65: "\u{f8f8}", 208: "Γ", 218: "Ω", 159: "√", 213: "Π", 211: "Λ", 49: "\u{f8f6}", 212: "Ξ", 58: "\u{f8f3}", 56: "\u{f8f1}", 51: "\u{f8f9}", 62: "\u{f8f4}", 210: "Θ", 217: "Ψ", 52: "\u{f8f0}", 55: "\u{f8fa}", 216: "Φ", 59: "\u{f8fe}"} for <</Type /Font/Subtype /Type1/BaseFont /BJKPRR+CMEX10/FirstChar 0/FontDescriptor 1313 0 R/LastChar 88/ToUnicode 1374 0 R/Widths 1287 0 R>>

For http://arxiv.org/pdf/2309.02511v2:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 44 in map {43: "⇁", 165: "Ξ", 13: "γ", 91: "♭", 184: "λ", 46: "▷", 74: "J", 85: "U", 78: "N", 121: "y", 111: "o", 28: "τ", 89: "Y", 101: "e", 176: "γ", 191: "τ", 162: "∆", 15: "ϵ", 5: "Π", 109: "m", 178: "ϵ", 177: "δ", 103: "g", 98: "b", 174: "α", 173: "Ω", 125: "℘", 194: "χ", 100: "d", 8: "Φ", 94: "⌣", 26: "ρ", 68: "D", 30: "ϕ", 12: "β", 75: "K", 54: "6", 70: "F", 175: "β", 181: "θ", 104: "h", 34: "ε", 4: "Ξ", 42: "⇀", 62: ">", 23: "ν", 119: "w", 38: "ς", 11: "α", 90: "Z", 195: "ψ", 193: "ϕ", 180: "η", 86: "V", 17: "η", 124: "ȷ", 35: "ϑ", 128: "ψ", 73: "I", 36: "ϖ", 166: "Π", 189: "ρ", 112: "p", 170: "Ψ", 107: "k", 77: "M", 120: "x", 99: "c", 76: "L", 93: "♯", 27: "σ", 64: "∂", 190: "σ", 50: "2", 29: "υ", 53: "5", 188: "π", 24: "ξ", 115: "s", 97: "a", 168: "Υ", 164: "Λ", 9: "Ψ", 39: "φ", 41: "↽", 25: "π", 118: "v", 66: "B", 67: "C", 187: "ξ", 81: "Q", 83: "S", 88: "X", 179: "ζ", 95: "⌢", 3: "Λ", 52: "4", 14: "δ", 122: "z", 31: "χ", 183: "κ", 22: "µ", 113: "q", 80: "P", 60: "<", 102: "f", 47: "◁", 82: "R", 32: "ψ", 6: "Σ", 110: "n", 169: "Φ", 84: "T", 123: "ı", 167: "Σ", 192: "υ", 87: "W", 161: "Γ", 106: "j", 37: "ϱ", 48: "0", 117: "u", 71: "G", 72: "H", 65: "A", 108: "l", 49: "1", 1: "∆", 96: "ℓ", 2: "Θ", 51: "3", 186: "ν", 59: ",", 63: "⋆", 16: "ζ", 105: "i", 92: "♮", 7: "Υ", 56: "8", 55: "7", 21: "λ", 160: " ", 33: "ω", 57: "9", 20: "κ", 58: ".", 69: "E", 116: "t", 18: "θ", 10: "Ω", 40: "↼", 114: "r", 19: "ι", 182: "ι", 0: "Γ", 185: "µ", 126: "\u{20d7}", 79: "O", 163: "Θ", 61: "/"} for <</Type /Font/Subtype /Type1/BaseFont /APPDUE+CMMI10/FirstChar 11/FontDescriptor 1143 0 R/LastChar 122/ToUnicode 1193 0 R/Widths 1129 0 R>>

For http://arxiv.org/pdf/2312.00735v1:

thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:750:27:
missing char 118 in map {159: "√", 62: "\u{f8f4}", 57: "\u{f8fc}", 218: "Ω", 213: "Π", 63: "\u{f8e6}", 64: "\u{f8ed}", 50: "\u{f8ee}", 66: "\u{f8ec}", 212: "Ξ", 55: "\u{f8fa}", 65: "\u{f8f8}", 58: "\u{f8f3}", 49: "\u{f8f6}", 215: "Υ", 53: "\u{f8fb}", 56: "\u{f8f1}", 67: "\u{f8f7}", 208: "Γ", 59: "\u{f8fe}", 216: "Φ", 160: " ", 210: "Θ", 217: "Ψ", 211: "Λ", 51: "\u{f8f9}", 54: "\u{f8ef}", 52: "\u{f8f0}", 60: "\u{f8f2}", 214: "Σ", 48: "\u{f8eb}", 61: "\u{f8fd}", 209: "∆"} for <</Type /Font/Subtype /Type1/BaseFont /KFVYMG+CMEX10/FirstChar 16/FontDescriptor 638 0 R/LastChar 118/ToUnicode 671 0 R/Widths 617 0 R>>
jrmuizel commented 7 months ago

aeb9a9dae7d456ecfdd014cc5f3e409e9fb57fd2 fixes the first pdf. I haven't tested the other ones yet.