naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.09k stars 2.15k forks source link

Custom traindata do not work #894

Closed frank-pian closed 4 months ago

frank-pian commented 4 months ago

V5.0.3

Describe the bug I followed the official demo with a slight modification. But checking the network doesn't load the new traindata. I've tested multiple path loading methods and none of them work."file:///", "./", "https://localhost/" I just want to replace the current simple traindata with the best traindata.

To Reproduce Steps to reproduce the behavior:

  1. My path structure screenshot-20240223-191418
  2. index.html

    <!DOCTYPE HTML>
    <html>
    <head>
    <script src="./tesseract.min.js"></script>
    </head>
    <body>
    <input type="file" id="uploader" multiple>
    <script type="module">
    
      // This example builds on "basic-efficient.html".
      // Rather than using a single worker, a scheduler manages a pool of multiple workers. 
      // While performance is similar for a single file, this parallel processing results in significantly
      // faster speeds when used with multiple files.
    
      const scheduler = Tesseract.createScheduler();
    
      // Creates worker and adds to scheduler
      const workerGen = async () => {
        const worker = await Tesseract.createWorker("eng", 1, {
          workerPath: './worker.min.js',
          corePath: './tesseract-core-simd-lstm.wasm.js',
          langPath: './',
          logger: function(m){console.log(m);}
        });
    
        scheduler.addWorker(worker);
      }
    
      const workerN = 4;
      (async () => {
        const resArr = Array(workerN);
        for (let i=0; i<workerN; i++) {
          resArr[i] = await workerGen();
        }
      })();
    
      const recognize = async function(evt){
        const files = evt.target.files;
    
        for (let i=0; i<files.length; i++) {
          scheduler.addJob('recognize', files[i]).then(
            (x) => {
              x.data.lines.forEach(symbol => {
                console.log(`${symbol.text} ${symbol.bbox.x0} ${symbol.bbox.x1} ${symbol.bbox.y0} ${symbol.bbox.y1}`);
              });
            }
          )
        }
      }
    
      const elm = document.getElementById('uploader');
      elm.addEventListener('change', recognize);
    </script>
    </body>
    </html>

https://github.com/naptha/tesseract.js/issues/101

Balearica commented 4 months ago

Language data files are cached by default. If Tesseract.js finds a valid language data file already exists in your local storage, it will not download a new one from a remote server. You can disable this behavior for development/testing purposes by setting cacheMethod: 'none', however should remove this setting before deploying your site publicly to avoid unnecessary network usage.

On an unrelated note, if you want your project to run on all devices, you should point corePath to a directory containing all of the different .wasm.js files from Tesseract.js-core. While tesseract-core-simd-lstm.wasm.js is the fastest, so is used by default on supported devices, older iOS devices (for example) do not support it. When corePath is set to a directory, the correct version will be picked automatically.

frank-pian commented 4 months ago

Setting cacheMethod: 'none' worked, thanks!