Question: How do you use this?

jakehockey10 commented 7 years ago

Hello,

I'm in a situation that almost mirrors your described usecase in the README. I would love to see if your library works for me, but I'm not sure how to use it. The example in the README shows how to specify options, but I'm not seeing where any work is being done in this example. Where do I do my work in this example?

Sorry for such a noob question, I'm still learning about managing memory and CPU usage with long-running processes and lots of them.

Thanks in advance

vangelov commented 7 years ago

Hello,

Thanks for you question, it's not noob or anything like this. It's actually my fault that I didn't explain the module well enough.

I'll be glad to help, just one question before that: are you trying to use it as a standalone program from the terminal or as a library by importing it in your own module?

jakehockey10 commented 7 years ago

I'm trying to use it in conjunction with my node app which consists of a main process (call it main.js) which kicks off a child process (call it worker.js). This child process is responsible for opening a pdf with pdf.js npm package, counting the pages in it, and for each page in each pdf, use tabula-js npm package to parse the tabular information inside. tabula-js kicks off a .jar file asynchronously with .spawn and I'm having a really hard time controlling how much CPU is used during this process. tabula-js gave no way to synchronously call that .jar file, so I'm trying to find a way to keep my operating system from bogging down whenever I run my node application.

Does that answer the question? If not, I can try again!

vangelov commented 7 years ago

Thanks for the detailed the explanation!

Here's the code for a small demo I tried and it seems to be working:

const tabula = require('tabula-js');
const limiter = require('cpulimit');

// Run tabula

const t = tabula('file.pdf'); // your pdf filename 
t.extractCsv((err, data) => console.log(data));

// Run cpulimit
const options = {
  limit: 20, // or any other value
  includeChildren: true,
  command: 'tabula-java.jar'
};

limiter.createProcessFamily(options, (err, processFamily) => {
  if (err) {
    console.error('Error:', err.message);
    return;
  }

  limiter.limit(processFamily, options, (err) => {
    if (err) {
      console.error('Error:', err.message);
    } else {
      console.log('Done.');
    }
  });
});

So essentially after you run tabula-js, run cpulimit.

Try it and tell me if this worked for you.

jakehockey10 commented 7 years ago

Thanks for the explanation! I'll give this a shot and let you know how it goes. My thoughts are this: since I'm running many tabula calls concurrently on a worker child process, I can run cpulimit in the main process right before forking off all these child processes. The difference to what you have here is that I would be calling into tabula a bunch of times and I would be doing so after running cpulimit as opposed to running tabula once and immediately running cpulimit after. Do you see that working?

Side-note: the reason I'm in this situation is this: tabula-js seems to be using the .spawn method to kick off the child process that runs the .jar file which is non-blocking. I tried to fork the tabula-js repository and let the developer use the .spawnSync method if they chose via an option, but couldn't get it working. You wouldn't know of any way I could get a nodejs program to wait for an external process like the running of a .jar file, would you? This is ultimately why I'm trying to use cpulimit because I am running tabula a lot of times and I can't find a way to slow it down other than throttling it myself (which I don't know how to do yet) or using something like cpulimit.

Thanks again for your help!

vangelov commented 7 years ago

But your callback is called when the child processes finishes -- why do you need to make it synchronous?

jakehockey10 commented 7 years ago

I guess what I could do is keep an array of "jobs" locally to my node process and pop off a new job in each callback to control the flow. My problem has been that I've been looping through pdfs, looping through those pdf's pages, and kicking off a tabula process for each of the 12 tables on each page. I'm having trouble reorganizing the code I've written to do this in a more conservative manner in which I'm not running hundreds of Tabula runs at once.

What would be your approach, if you don't mind me asking?

vangelov commented 7 years ago

I suggest you use a task queue and add each page of each pdf file as a task. Then process those tasks with the right concurrency level (how many tasks are being processed in parallel at any moment). Try using https://caolan.github.io/async/docs.html#queue, it gives you all of the above features.

vangelov / node-cpulimit

Question: How do you use this? #1