Open neoOpus opened 1 month ago
My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives
@neoOpus This is similar to an area I have spent a fair bit of time thinking about/prototyping tooling around in the past. One of the bigger issues that you're likely to find here is that with bundlers like webpack/etc, when they minimise the variable names, they won't necessarily choose the same minified variable name for the same code each time. So to make a 'lookup table' type concept work, you first need to be able to stabilise the 'reference key' for each of those variables, even if the bundler chose something different to represent it.
You can find some of my initial hacky prototypes scattered in this repo:
My thoughts/notes on this are scattered around a few places, but these may be some useful/interesting places to start:
You can see an example of a larger scale project where I was trying to stabilise the minified variable names to reduce the 'noise' in large scale source diffing here:
(Edit: I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code)
Thank you, Glenn, for taking the time to answer me in detail...
I've wrote a message before this one, but it got lost...
I am going through the links you just shared, and I will get back to you with some ideas, I think I already have some that are worth discussing but I want to make sure before that that I have valid and viable ones as my knowledge is still very limited in this area.
Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc.
Just to clarify that I'm on the same page here, is the issue that:
This is an interesting problem. I'd love to research some ways to implement this. Especially AST fingerprinting seems promising, thank you @0xdevalias for your links.
One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source.
In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.
I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful!
My suggestion is to break the code down into smaller, modular functions, which seems to be a practice your script might already be implementing. One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it).
Anyway, this would allow for a standardized, minified version of the code. After creating this stripped down and abstracted version, we could calculate a hash of the code as a string. This hash would serve as a unique identifier to track changes portions of the code from different versions of the project and prevent duplicate entries as well as a reference to where to store the future generated variable names. The resulting data could be stored in an appropriate format, such as CSV, NoSQL, or JSON, based on your requirements for speed, scalability, and ease of access.
Next, we could analyze this stored data from a designated project location or a maybe specified subfolder (into .humanifjs). Here, we could leverage language models (LLMs) to generate meaningful variable names based on the context of the functions. This would create a "reference" that can assist in future analyses of the code.
When new versions of the obfuscated code are generated (which will have different variable names), we can apply a similar process to compare them with previously processed versions. By using diff techniques, we can identify changes and maintain a collection of these sub-chunks of code, which would help reduce discrepancies. In most cases, we should see a high degree of similarity unless a particular function’s logic has altered. We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks.
Additionally, to enhance this process, we could explore various optimizations in how the LLM generates and assigns these variable names, as well as how we handle the storage and retrieval of the chunks.
I look forward to your thoughts on this approach and any suggestions you may have for improving it further!
What would make this work better is to make it able to work take advantage of diff (compare) technics to make some sort of sub-chuncks then keeping them available to reduce the discrepancy, maybe also optimize the generation... I hope this makes sense.
And as you stated here
One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source.
In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.
This would be optimal indeed as it will allow to leverage the collective work to get the best results.
PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly.
One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source.
@jehna Agreed. This was one of the ideas that first led me down the 'fingerprinting' path. Though instead of 'deterministically reversing the code to the original source' in its entirety (which may also be useful), my plan was first to be able to detect dependencies and mark them as such (as most of the time I don't care to look too deeply at them), and then secondly to just be able to extract the 'canonical variable/function names' from that original source and be able to apply them to my unminified version (similar to how humanify
currently uses AI for this step); as that way I know that even if there is some little difference in the actual included code, I won't lose that by replacing it with the original source. These issues on wakaru
are largely based on this area of things:
While it's a very minimal/naive attempt, and definitely not the most robust way to approach things, a while back I implemented a really basic 'file fingerprint' method, mostly to assist in figuring out when a chunk had been renamed (but was otherwise largely the same chunk as before), that I just pushed to poc-ast-tools
(https://github.com/0xdevalias/poc-ast-tools/commit/b0ef60f8608385c40de2644b3346b1834eb477a0):
When I was implementing it, I was thinking about embeddings, but didn't want to have to send large files to the OpenAI embeddings API; and wanted a quick/simple local approximation of it.
Expanding on this concept to the more general code fingerprinting problem; I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed. I would also probably be normalising the code to remove any function/variable identifiers first; and to remove the impact of whitespace differences/etc.
While it's not applied to generating a fingerprint, you can see how I've used some of these techniques in my approach to creating a 'diff minimiser' for identifying newly changed code between builds, while ignoring the 'minification noise / churn':
In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs.
@jehna Oh true.. yeah, that definitely makes sense. Kind of like a local cache.
One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it).
@neoOpus This would be handled by parsing the code into an AST, and then manipulating that AST to rename the variables.
You can see various hacky PoC versions of this with various parsers in my poc-ast-tools
repo (I don't remember which is the best/most canonical as I haven't looked at it all for ages), eg:
Which you can see some of the early hacky mapping attempts I was making in these files:
That was the point where I realised I really needed something more robust (such as a proper fingerprint that would survive code minification) to use as the key.
We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks.
@neoOpus Re-applying the old variable names to the new code wouldn't need an LLM at all, as that part is handled in the AST processing code within humanify
:
Don't let AI touch the code Now while LLMs are very good at rephrasing and summarizing, they are not very good at coding (yet). They have inherent randomness, which makes them unsuitable for performing the actual renaming and modification of the code.
Fortunately renaming a Javascript variable within its scope is a solved problem with traditional tools like Babel. Babel first parses the code into an abstract syntax tree (AST, a machine representation of the code), which is easy to modify using well behaving algorithms.
This is much better than letting the LLM modify the code on a text level; it ensures that only very specific transformations are carried out so the code's functionality does not change after the renaming. The code is guaranteed to have the original functionality and to be runnable by the computer.
I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful!
@neoOpus At a high level, it seems that the thinking/aspects you've outlined here are more or less in line with what I've discussed previously in the resources I linked to in my first comment above.
PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly.
@neoOpus IMO, the bulk of the 'harder parts' of implementing this aren't really LLM related, and shouldn't require a powerful machine. The areas I would suggest most looking into around this are how AST parsing/manipulation works; and then how to create a robust/stable fingerprinting method.
IMO, figuring the ideal method of fingerprinting is probably the largest / potentially hardest 'unknown' in all of this currently (at least to me, since while I started to gather resources for it, I haven't had the time to deep dive into reading/analysing them all):
Off the top of my head, I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed; and then generate fingerprints for them.
I would also potentially consider looking at the module/function 'entry/exit' points (eg. imports/exports); or maybe even the entire 'shape' of the module import graph itself.
I would also probably be normalising the code to remove any function/variable identifiers and to remove the impact of whitespace differences/etc; before generating any fingerprints on it.
Another potential method I considered for the fingerprints is identifying the types of elements that tend to remain stable even when minified, and using those as part of the fingerprint. As that is one of the manual methods I used to be able to identify a number of the modules listed here:
(Edit: I have captured my notes from this comment on the following gist for posterity: https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code)
Resume-ability would also be a good thing to consider.
Some of the discussion in the following issue could tangentially relate to resumability (specifically if a consistent 'map' of renames was created, perhaps that could also show which sections of the code hadn't yet been processed):
Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/167#issuecomment-2425538385
Hi,
I have an idea that I hope will be helpful and prompt some discussion.
Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc.
My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives, updated continuously. This would serve as a lookup table for the LLM, helping maintain consistency and reducing variations in the final output. Admittedly, I haven't fully explored the feasibility of this concept, but I believe it would strengthen reverse-engineering processes.