naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.91k stars 2.21k forks source link

Missing "languages" attributes on default export #887

Open MasGaNo opened 7 months ago

MasGaNo commented 7 months ago

Tesseract.js version 5.0.4

Describe the bug The languages constant object is missing from definition despite being exported in the index.js

To Reproduce Steps to reproduce the behavior:

  1. Install tesseract.js
  2. Try to import import { languages } from 'tesseract.js';
  3. See error

Please attach any input image required to replicate this behavior. image

Expected behavior The expected behavior is to have access to languages in TypeScript codebase and avoid these kind of issue Also, it will help to be more TypeSafe and to create some validators rules with Zod/Yup/Joi/... by passing this object directly as source of truth. image

Device Version:

Additional context My current workaround to fix this issue is to create a tesseract.d.ts file in my project and add this block:

export * from 'tesseract.js';

declare module "tesseract.js" {
  export const languages: Record<'AFR' | 'AMH' | 'ARA' | 'ASM' | 'AZE' | 'AZE_CYRL' | 'BEL' | 'BEN' | 'BOD' | 'BOS' | 'BUL' | 'CAT' | 'CEB' | 'CES' | 'CHI_SIM' | 'CHI_TRA' | 'CHR' | 'CYM' | 'DAN' | 'DEU' | 'DZO' | 'ELL' | 'ENG' | 'ENM' | 'EPO' | 'EST' | 'EUS' | 'FAS' | 'FIN' | 'FRA' | 'FRK' | 'FRM' | 'GLE' | 'GLG' | 'GRC' | 'GUJ' | 'HAT' | 'HEB' | 'HIN' | 'HRV' | 'HUN' | 'IKU' | 'IND' | 'ISL' | 'ITA' | 'ITA_OLD' | 'JAV' | 'JPN' | 'KAN' | 'KAT' | 'KAT_OLD' | 'KAZ' | 'KHM' | 'KIR' | 'KOR' | 'KUR' | 'LAO' | 'LAT' | 'LAV' | 'LIT' | 'MAL' | 'MAR' | 'MKD' | 'MLT' | 'MSA' | 'MYA' | 'NEP' | 'NLD' | 'NOR' | 'ORI' | 'PAN' | 'POL' | 'POR' | 'PUS' | 'RON' | 'RUS' | 'SAN' | 'SIN' | 'SLK' | 'SLV' | 'SPA' | 'SPA_OLD' | 'SQI' | 'SRP' | 'SRP_LATN' | 'SWA' | 'SWE' | 'SYR' | 'TAM' | 'TEL' | 'TGK' | 'TGL' | 'THA' | 'TIR' | 'TUR' | 'UIG' | 'UKR' | 'URD' | 'UZB' | 'UZB_CYRL' | 'VIE' | 'YID', string>;
}

But it would be better to generate the definition directly from project and by importing the JSDoc on the languages constants

Thank you.

Balearica commented 7 months ago

I agree this is a good suggestion, and would reduce errors like the one you linked. However, I believe this is a breaking change so the soonest this could be implemented is Tesseract.js v6.0.

Making this change would break code for (1) TypeScript users specifying a custom language and (2) TypeScript users specifying multiple languages by concatenating them with + (e.g. eng+chi_sim). I do not believe this prevents us from ever making this change, as users with multiple languages can switch to specifying them with arrays (e.g. ['eng', 'chi_sim']) and users with custom languages (if any exist) can add a ts-ignore comment. However, this does mean such a change would need to wait for the next major release.

I will update the documentation to remove anything referencing the concatenation method for specifying multiple languages.