quil-lang / quilc

The optimizing Quil compiler.
Apache License 2.0
452 stars 73 forks source link

Refactor `COMPRESS-INSTRUCTIONS` #830

Closed markasoftware closed 1 year ago

markasoftware commented 2 years ago

Passes all tests, and produces output just as good as the original implementation on some basic benchmarks I ran.

(commits should probably be squashed)

markasoftware commented 2 years ago

While the output is just as short as the output from the original COMPRESS-INSTRUCTIONS, my implementation is not exactly functionally identical. The main difference is that the original requires all queues to correspond to actual hardware objects, and has one "global" queue which does not correspond to a hardware object. My implementation allows each queue to use an arbitrary set of resources, not necessarily corresponding to a real hardware object, and does not have a concept of a "global" queue. This means my implementation can have multiple 3-qubit queues simultaneously, for example.

If these details of the original implementation are important for some reason (maybe speed in certain situations?), adding them to my branch will be easy.

braised-babbage commented 2 years ago

This is well written code, and notably shorter than what it replaces, so to the extent that it behaves similarly (as a functional component of quilc) to what we had before, I'm happy to see it merged.

However, I am a bit hesitant to encourage piecemeal changes at this stage, in the sense that I don't think we have an entirely clear perspective on the compressor, i.e. what we want from it, what we expect, what sort of compromises we are making, and so forth. I have some limited perspective, of course: I know that we want peephole optimization to clean up after the translation to native code, I know that we additionally support rather generic "decompilation" for certain arity (e.g. take a 1Q sequence to ZXZ), and so forth. But to the extent that the compressor represents a kind of heuristic solution to some well posed but intractable problem, it might be valuable to articulate that problem a bit more explicitly. In other words, I think this PR looks good, but (to me) it's most exciting when it is in the context of a larger renewal of thinking on how quilc does its job.

Along similar lines, I wonder the extent to which we can or should think of compression as simply being a one-size fits all mechanism "parameterized" by the chip specification and defined compilers. As you note the original implementation makes a conscious decision to associate queues to hardware objects. I don't take this by itself as a good or bad decision, but I do take it as indicative that the original design envisioned a certain relationship between our understanding of "compression" and the device that we are compiling to.

One can imagine quite disparate architectures

I suspect (but might be wrong) that many of the comparisons made in comparing this PR against the old compressor are with respect to the first item in that list, because this is mostly what we have tested & benchmarked quilc on. It is in the spirit of quilc that many of the difference between the aforementioned architectures can come simply down to the available compilers on the chip and how they are exposed via hardware objects. On the other hand, it is not obvious to me that compressing well is quite as simple as that.

markasoftware commented 2 years ago

The purpose of this PR is not to change the paradigm of compression in quilc from chip-specific to generic. In fact, my goal is not to change the behavior of COMPRESS-INSTRUCTIONS at all. I only changed the behavior when I believed the original behavior to be more a quirk of the implementation rather than an intentional, well-thought-out choice. (The old implementation creates all the state machines/queues that it needs right at the start. If your implementation is going to precreate all its queues, then hardware objects are a reasonable way to pick those queues. But in an implementation like my own where the queues are created on-demand, there is no reason to limit yourself to hardware objects afaik).

A more philosophical discussion about the purpose of the compressor is certainly a good idea before I start really messing with it in later PRs.

braised-babbage commented 2 years ago

The purpose of this PR is not to change the paradigm of compression in quilc from chip-specific to generic. In fact, my goal is not to change the behavior of COMPRESS-INSTRUCTIONS at all. I only changed the behavior when I believed the original behavior to be more a quirk of the implementation rather than an intentional, well-thought-out choice. (The old implementation creates all the state machines/queues that it needs right at the start. If your implementation is going to precreate all its queues, then hardware objects are a reasonable way to pick those queues. But in an implementation like my own where the queues are created on-demand, there is no reason to limit yourself to hardware objects afaik).

A more philosophical discussion about the purpose of the compressor is certainly a good idea before I start really messing with it in later PRs.

That's very fair, and I didn't intend to impute any additional meaning to these changes than what you had written in your earlier comment. The second half of my remark was more just me running down a tangent. Regardless, I think this PR is good!

markasoftware commented 2 years ago

Thanks for the feedback @ecpeterson.

I've been looking a little bit more into the speed of my implementation. Without a global queue, it does tend to create a greater number of queues than the original implementation, and is typically about 10% slower on circuits involving more than 2 qubits. I'd assume the extra flexibility could lead to better compression in some cases, so this isn't a problem IMO.

However, when increasing *global-queue-tolerance-threshold* to be greater than 3, the speed gets worse. On some circuits I tried, the performance was 50% worse than the original when *global-queue-tolerance-threshold* is set to 4. That's because if we start with size 4 queues, then my implementation creates size 3 queues next, then size 2, and finally size 1. On the other hand, the original implementation would create a size 4 global queue, but then immediately drop down to size 2 queues, then size 1.

I could use some guidance on whether there's any reason to compress with an "intermediate" sized queue (ie, a queue with size strictly between 2 and *global-queue-tolerance-threshold*). For example, I'm skeptical that compression on a size-3 queue will make additional progress on a sequence of instructions that was already compressed as part of a size-4 queue, since no new compilers are introduced until we go down to size 2.

In the examples I've been playing around with, my implementation doesn't compress noticeably better than the original. But maybe there are some situations where it could.

It's very easy for me to replicate the original behavior here if desired and jump straight down to size 2 queues during recursion. In fact, I have already been playing around with this, and it performs almost identically to the original in terms of both output quality and speed.

markasoftware commented 2 years ago

I just pushed a change to restrict the recursive queue sizes to 2 as discussed in my previous comment.

IMO the PR is now ready to merge.

ecpeterson commented 1 year ago

Sorry for forgetting to reply to your query! I'm not surprised about the speed degradation for "wide" queues, but I also don't think it's worth fretting about. You can jump to 2 or not bother. I think it's exceedingly rare that traversing all the intermediate queue sizes will yield an improvement — even if quilc were capable of optimal nQ compilation, which it is not. On the other hand, jumping to 2 is kind of a weird quirk that may spook some future maintainer. 🤷‍♀️

Merge it either way imo.

stylewarning commented 1 year ago

Sorry for forgetting to reply to your query! I'm not surprised about the speed degradation for "wide" queues, but I also don't think it's worth fretting about. You can jump to 2 or not bother. I think it's exceedingly rare that traversing all the intermediate queue sizes will yield an improvement — even if quilc were capable of optimal nQ compilation, which it is not. On the other hand, jumping to 2 is kind of a weird quirk that may spook some future maintainer. 🤷‍♀️

Merge it either way imo.

I'm going to merge this, but @markasoftware could you add a comment to the source code with some of these sentiments so as to not spook future maintainers, and make a separate PR?