modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.98k stars 378 forks source link

Parser NO SPACE - "pdf2json": "^3.1.3", #355

Open CristianKalexAI opened 1 month ago

CristianKalexAI commented 1 month ago

20240626 Mercadona 146,53 €.pdf

const buffer = Buffer.from(await file.arrayBuffer()) const pdfParser = new PDFParser(null, true); pdfParser.on("pdfParser_dataReady", (pdfData) => { const text = pdfParser.getRawTextContent() console.log(text) }) pdfParser.on("pdfParser_dataError", (errData) => { console.log('error') return NextResponse.json({ error: errData.parserError }, { status: 400 }) }) pdfParser.parseBuffer(buffer)

// RESULT

MERCADONA, S.A. A-46103834 C/ ALICANTE 83 03801 ALCOI/ALCOY TELÉFONO:965366422 26/06/2024 11:58 OP: 1475177 FACTURA SIMPLIFICADA: 4530-016-095252 DescripciónP. UnitImporte 1FUET ESPETEC EXTRA2,57 // NO SPACE it will be 1 FUET ESPETEC EXTRA 2,57 1PAÑUELO LOCION1,50 // NO SPACE it will be 1 PAÑUELO LOCION 1,50 1PAN S/CORTEZA BLANCO1,60 // NO SPACE ....... 1SNACK PIPAS1,35 1SALADITOS1,25 1BARRA DE PAN 3 UDS1,15 1ESPIRALES MOSQUITOS2,00 1BURGER VACUNO CERDO3,78 1INS PERFUMADO2,25 1P. PAV RED. SAL BIPA3,30 112 HUEVOS GRANDES-L2,20 1CHULETA AGUJA4,21 1PALITOS SURIMI1,99 1BURGER M VA/CE 1000G6,50 2PIZZA JAMÓN SERRANO2,905,80 1FILETE MERLUZA CABO5,10 1PORCIONES BACALAO7,20 1LONCHAS DE QUESO2,20 1GALL DIGESTIVE1,70 1GRIEGO AZUCAR CAÑA1,65 1YOG LIQ AZUCAR CAÑA2,05 1PETIT DE BOLSILLO FR1,60 2PANECILLO 11 UDS1,102,20 1MELOCOTÓN BANDEJA3,16 1AGUACATE BANDEJA2,83 1MUESLI C/CHOCOLATE2,25 1ZANAHORIA BOLSA1,09 1LAVAVAJILLAS ULTRA1,10 2MARCILLA ESPRESSO5,3010,60 1STICK SOLAR FPS505,50 1FUSILLI1,24 2TOMATE ENTERO PELADO1,352,70 1MAIZ DULCE PACK-31,60 1PATATA 3 KG4,90 1REPELENTE FUERTE2,75 1DENTAL PEARL1,95 1TOMATE FRITO CASERO1,75 2GARBANZO M.COCIDO0,711,42 1CEBOLLA 2 KG2,79 1SALMOREJO FRESCO3,15 1PROT.NORMAL PLEGADO1,10 1LECHE ENTERA P68,82 2BRONCHALES 6X1,5L2,344,68 1CERVEZA 6 X 1 L7,50 1BROCOLI 0,640 kg2,49 €/kg1,59 1PARAGUAYO 0,378 kg2,49 €/kg0,94 1TOMATE ENSALADA 0,892 kg1,99 €/kg1,78 1MANZ. ROJA ACIDULCE 0,992 kg2,75 €/kg2,73 1BERENJENA RAYADA GR 0,326 kg1,99 €/kg0,65 1CALABACIN BLANCO 0,582 kg1,99 €/kg1,16 TOTAL (€)146,53 TARJETA BANCARIA146,53 IVABASE IMPONIBLE (€)CUOTA (€) 0%41,790,00 4%1,060,04 5%1,180,06 10%70,777,08 21%20,294,26 TOTAL135,0911,44 TARJ. BANCARIA: **** 9102 N.C: 034903088 AUT: 261213 AID: A0000000031010 ARC: 3030 Verificado por dispositivo Importe: 146,53 € Visa DEBIT SE ADMITEN DEVOLUCIONES CON TICKET ----------------Page (0) Break----------------

On pdf there are spaces, i need to add some parameters ? Thanks in advance to help me

JorrieB commented 1 month ago

I'm having a similar problem - I'm observing no spaces at all when parsing 5 different PDFs. I saw a reddit thread with the same problem I'm facing (source).

An example PDF is being parsed as follows:

TermsofEmploymentDocument 1.EmployeeClassification 1.1Full-TimeEmployees Full-timeemployeesarethosewhoareregularlyscheduledtoworkthe company’sstandardworkweek.Full-timeemployeesareeligibleforallcompany benefits,subjecttothetermsandconditionsofeachbenefitprogram. 1.2Part-TimeEmployees Part-timeemployeesarethosewhoareregularlyscheduledtoworkfewerhours thanthestandardworkweekofthecompany.Part-timeemployeesareeligiblefor somecompanybenefits,whichwillbespecifiedintheemployeehandbook,and aresubjecttothetermsandconditionsofeachbenefitprogram. 1.3Temporary/ContractEmployees Temporaryorcontractemployeesarehiredtoworkonaspecificprojectorfora predeterminedperiod.Theseemployeesarenoteligibleforbenefitsexceptas requiredbylawandarenotconsideredpermanentorpart-timeemployeesofthe company. 2.AttendancePolicy 2.1GeneralAttendanceRequirements Allemployeesareexpectedtoadheretotheirscheduledworkhours.Regular attendanceandpunctualityareimportanttomaintainteameffectivenessandthe smoothoperationofthecompany. ----------------Page (0) Break---------------- 2.2ReportingAbsences Intheeventofanabsence,employeesmustnotifytheirdirectsupervisorasearly aspossible,butnolaterthanonehourbeforetheirscheduledstarttime. Notificationshouldincludethereasonfortheabsenceandtheexpectedreturn date. 2.3UnscheduledAbsences Frequentunscheduledabsences(absenceswithoutpriorapproval)are unacceptableandmayleadtodisciplinaryaction,uptoandincludingtermination ofemployment. 2.4Long-termAbsences Forabsencesduetomedicalreasonsthatareexpectedtolastbeyondfive consecutiveworkdays,employeesarerequiredtoprovideamedicalcertificate anddiscusstheirsituationwithHumanResourcestodeterminetheappropriate courseofaction,whichmayincludeshort-termdisability,long-termdisability,or otherleavesofabsence. 2.5LeavePolicies Thecompanyprovidesvariousleavesofabsencetoaccommodatetheneedsof employees,includingbutnotlimitedtomedicalleave,familyleave,and bereavementleave.Employeesshouldrefertotheemployeehandbookor contactHumanResourcesforinformationaboutleaveentitlementsand proceduresforapplyingforleave. ----------------Page (1) Break----------------