net4people / bbs

Forum for discussing Internet censorship circumvention
3.49k stars 82 forks source link

About GFW's heart bleeding attack: Why it's fixed? #340

Open bsod-qso opened 9 months ago

bsod-qso commented 9 months ago

I want to have a disscussion about Bleeding Wall: A Hematologic Examination on the Great Firewall.

In the paper they mentioned the heart bleeding was fixed without prior bug public disclosure. It's obvious that the authors did not submitted to CNCERT.

Then I have a question: How does this bug got discovered by GFW's devops guys? I think the brainstorming can be helpful for GFW analyzers so we can discover more bugs in its implementation. Considering GFW must be connected to network (since they need to pollute packets), maybe we can even have a chance to execve() on GFW. Ambitious is always good!

I have several guessing, sorted by possibility:

Another bug

GFW requires high speed processing of packets, so they may not using kernel's network stack (according from paper, it's safe to guess they are using Linux and major version is above or equal to 3). If I'm using DPDK for DNS sniffing & hijacking, I will just put raw packet data in memory buf and call some function like this:

status_t process_packet(uint8_t* buf, size_t buflen) {
  // ...
  if ((GET_IP_HEADER_TYPE(buf) == IP_HEADER_TYPE_UDP) && (GET_UDP_TARGET_PORT(buf) == 53)) {
    return process_dns_packet(buf, buflen)
  }
}

status_t process_dns_packet(uint8_t *buf, size_t buflen) {
  // ...
  uint32_t fake_ip = 0;  if (query_dns_hijack_table(&dns_buf[1], dns_buf[0])) {
    send_dns_fake(dns_buf);
  }
  // ...
}

void send_dns_fake(uint8_t *buf, size_t buflen) {
  // ...
  dns_buf = malloc(...);
  memcpy(dns_buf, DNS_BASE, DNS_BASE_LENGTH);
  dns_buf += DNS_BASE_LENGTH;
  memcpy(dns_buf, buf, buflen);
  //...
  udp_send(dns_buf, dns_buf_len);
  // ...
}

This is enough to produce the bug described in the paper.

According to paper, authors found out that:

Format 2: Overflow the first label. ...Besides, this format had a higher success rate...

If the bug is simply some overflow on memcpy or read-out-of-bound, there should nearly always be a success. Because the maximum info can be read is 0xFF, which is way smaller than page size of all modern OS / memory manager implementations. Also, if the variable is on stack, this is far from bottom of stack. So, I don't think the failure is caused by the buggy code itself. The size is too small, and it will never trigger paging issues.

Then I realized the fact of GFW is a large-scale surveillance & intrusion system. Thus, GFW would require some telemetry work, so they can collect data and make future statistics, maybe using Hadoop, Spark or Hive. So the actual process_dns_packet might be:

status_t process_dns_packet(uint8_t *buf, size_t buflen) {
  // ...
  uint32_t fake_ip = 0;  if (query_dns_hijack_table(&dns_buf[1], dns_buf[0])) {
    send_dns_telemetry(dns_buf, TELEMETRY_DNS_HIJACK_HIT");
    send_dns_fake(dns_buf);
  }
  // ...

void send_dns_telemetry(uint8_t* buf, const uint8_t event) {
  memcpy(telemetry_data_ring_buffer_curr, buf, 0xff);
  telemetry_event_ring_buffer_curr = event;
}

And the ring buffer must be processed later. To process it in Hive, I would consider it as string, so I would write this code:

void dns_telemetry_send_to_hive(uint8_t* buf, const uint8_t event) {
  //...
  size_t len = strlen(&buf[1]);
  dns_hive_report_struct *item = malloc(sizeof(dns_hive_report_struct));
  item->value = malloc(len);
  memcpy(item->value, &buf[1], buf[0]); // <--- HEAP OVERFLOW
  //...
}

So when author of the paper says the attack is failed, there is actually a heap overflow, which would almost certain be captured by memory manager and cause a coredump.

The GFW developers would notice this uncommon raise of coredumps, find the core dump, analyze it, and fix it.

Info leak

Author(s) might have spread info about this bug as something interesting to a public/private chat group, and the message is accidentally discovered by the Chinese government. I think discussing bugs in friendly chats is common, but being caught by the government is not a good sign. Maybe the author can tell us if he/she/it/they have spread the message earlier?

Code audit

Maybe they will use some tools (like ASAN for dynamic debugging, or Coverity for rule checker) but I doubt it. It requires some extra work on the engineering side of GFW's code (I mean SDL, life cycle and so on). I doubt if they have it. They may not have CI, and relies on some graduated student logging into a certain machine via ssh, run make manually to build it, and spread it via rsync.

Traffic replaying

Maybe there hired some contractors. They will reply a fraction of sample of traffics going through the firewall, and feed them into some human analysis platform, or feed them into another implementation of GFW, so they can find new traffics. But I doubt if they really have enough resource and willing to do this. This isn't common unless your network is really small and the stuff is taken seriously.

klzgrad commented 8 months ago

It's obvious that the authors did not submitted to CNCERT.

It is as good as having a submission because either way we have observed the way the fix was deployed.

Maybe they will use some tools (like ASAN for dynamic debugging

You don't run ASAN in production. Also this class of bugs requires fuzzing to effectively detect. Or without fuzzing, it requires Linux kernel level of code review, which I presume they don't have.

They may not have CI, and relies on some graduated student logging

I think they do have CI, and CD, as the data in the paper shows.

maybe we can even have a chance to execve()

But you get only so few bytes back, and can't do anything even close to execve. I propose to fuzz other areas in the GFW, e.g. TLS ClientHello parsers and see if anything interesting happens. No, there is no data forgery in the response.

Edit: Also, I saw reports that a type of query in the paper can still reproduce data leak after the publication.

bsod-qso commented 8 months ago

Oh the whole paragraph Code Audit is about offline dynamic/static analysis, not working on online servers. I suspect if they have guts about it.

And guessing of not having CI/CD is based on Figure 9. The collection rate is dropping too fast with weird sharps and different shapes. If they have some auto deployment pipeline, the dropping rate usually can be grouped into some certain patterns.

As for execve. Well, like what I've said, Ambitious is always good!