jehna / humanify

Deobfuscate Javascript code using ChatGPT
MIT License
1.62k stars 66 forks source link

Parallel renames #167

Open jehna opened 5 days ago

jehna commented 5 days ago

My thought is, that it should speed up the process a lot if the renames were done in parallel. Especially if the user has enough OpenAI quota, it could be much faster to process large files by parallelising the work.

Local inference should also be able to be run in parallel, if the user has good enough GPU at hand.

One big problems is, that I've gotten the best results when applying renames from the bottom up – so say we have:

function a() {
  const b = (c) => {
    return c * c
  }
}

It seems that running the rename in order a -> b -> c yields much better results than running c -> b -> a.

But if we'd have multiple same-level identifiers like:

function a() {
  function b() {}
  function c() {}
  function d() {}
}

At least in theory it would be possible to run a first and [b, c, d] in parallel to get feasible results.

In the best case scenario there would be a second LLM step to check that all variables still make sense after the parallel run has finished.

jehna commented 5 days ago

Need to implement proper request throttling and retry logic when doing this

0xdevalias commented 3 days ago

Need to implement proper request throttling and retry logic when doing this

Related:

This seems to be the section of code for implementing better throttling/retry logic (at least for the openai plugin):

brianjenkins94 commented 2 days ago

Resume-ability would also be a good thing to consider.

0xdevalias commented 2 days ago

Resume-ability would also be a good thing to consider.

Some of the discussion in the following issue could tangentially relate to resumability (specifically if a consistent 'map' of renames was created, perhaps that could also show which sections of the code hadn't yet been processed):

brianjenkins94 commented 1 day ago

I'm trying to process a pretty huge file and just ran into this:

RateLimitError: 429 Rate limit reached for gpt-4o-mini in organization org-abcdefghijklmnopqrstuvwx on requests per day (RPD): Limit 10000, Used 10000

I'm going to see about improving the rate limiting here:

// /src/plugins/openai/openai-rename.ts
+import Bottleneck from "bottleneck/light";

+// Math.floor(10_000 / 24) requests/hour
+const limiter = new Bottleneck({
+   "reservoir": Math.floor(10_000 / 24),
+   "reservoirRefreshAmount": Math.floor(10_000 / 24),
+   "reservoirRefreshInterval": 3_600_000
+});

export function openaiRename({
  apiKey,
  baseURL,
  model,
  contextWindowSize
}: {
  apiKey: string;
  baseURL: string;
  model: string;
  contextWindowSize: number;
}) {
  const client = new OpenAI({ apiKey, baseURL });

+  const wrapped = limiter.wrap(async (code: string): Promise<string> => {
    return await visitAllIdentifiers(
      code,
      async (name, surroundingCode) => {
        verbose.log(`Renaming ${name}`);
        verbose.log("Context: ", surroundingCode);

        const response = await client.chat.completions.create(
          toRenamePrompt(name, surroundingCode, model)
        );
        const result = response.choices[0].message?.content;
        if (!result) {
          throw new Error("Failed to rename", { cause: response });
        }
        const renamed = JSON.parse(result).newName;

        verbose.log(`Renamed to ${renamed}`);

        return renamed;
      },
      contextWindowSize,
      showPercentage
    );
+  });

+  return wrapped();
}
0xdevalias commented 21 hours ago

Context from other thread:

Aside: Can I run humanify against multiple files simultaneously? Or would that run the risk of making requests too fast?

In trying to produce benchmark results for #172, I have determined that simultaneous runs cause 429s and cause the application to crash.

Just closing the loop.

Originally posted by @brianjenkins94 in https://github.com/jehna/humanify/issues/67#issuecomment-2427685842