jbaublitz / neli

Rust type safe netlink library
BSD 3-Clause "New" or "Revised" License
180 stars 35 forks source link

Neli panics in spawn_processing_thread whenever it receives an Audit subsystem message with no nlmsghdr. #226

Open camconn opened 1 year ago

camconn commented 1 year ago

Version of neli v0.7.0-rc2

Does the documentation specify that neil's behavior is correct? No, but it's not Neli's fault. I'm using the Audit Netlink family which is well-known to be inconsistent with other Netlink families. This issue is unique to Audit which sends groups of audit messages with a struct nlmsghdr in front followed by strings with no additional nlmsghdrs.

Does there appear to be a mistake in the netlink documentation based on the kernel code? No. The Linux Audit subsystem is poorly documented and inconsistent with other Netlink famillies. Other libraries/programs have to implement their own parsing (like go-libaudit and auditd) because Audit isn't consistent with its siblings.

Describe the bug Neli incorrectly handles Audit messages and treats them as additional Netlink headers . This causes neli to pull out outrageous message lengths like 808464509 from an ASCII audit message.

To Reproduce Create a new cargo project:

cargo new --bin repro

Then in Cargo.toml:

[package]
name = "repro"
version = "0.1.0"
edition = "2021"

[dependencies]
log = "0.4.20"
neli = { version = "0.7.0-rc2", features = ["async", "sync"] }
simple_logger = "4.2.0"
tokio = { version = "1.32.0", features = ["time"] }

And in main.rs:


use std::vec::Vec;

use log::{error, info};
use neli::consts::nl::{NlType, NlmF};
use neli::consts::socket::NlFamily;
use neli::err::RouterError;
use neli::nl::NlPayload;
use neli::utils::Groups;
use tokio;

#[repr(C)]
#[derive(Default,Debug,PartialEq,Eq,Clone)]
pub struct AuditStatus {
    mask: u32,
    enabled: u32,
    failure: u32,
    pid: u32,
    rate_limit: u32, 
    backlog_limit: u32,
    lost: u32,
    backlog: u32,
    feature_bitmap: u32,
    backlog_wait_time: u32,
    backlog_wait_time_actual: u32,
}

#[neli::neli_enum(serialized_type = "u16")]
pub enum AuditType {
    GetStatus = 1000u16,
    SetStatus = 1001u16,
    ConfigChange = 1305u16,
}
impl NlType for AuditType {}

fn main() {
    simple_logger::init_with_level(log::Level::Trace).unwrap();
    let runtime = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();
    let _ = runtime.block_on(panics_on_audit_message())
        .unwrap();
}

type PL = Vec<u8>;

async fn panics_on_audit_message() -> Result<(), String> {
    let bundle = neli::router::asynchronous::NlRouter::connect(
        NlFamily::Audit,
        None,
        Groups::empty()
    ).await.map_err(|_| "Whoops".to_string())?;
    let router = bundle.0;

    let mut enable_status = AuditStatus::default();
    enable_status.mask = 0x00001 | 0x00004;
    enable_status.enabled = 1;
    enable_status.pid = std::process::id();

    let mut as_bytes: PL = Vec::new();
    unsafe {
        let raw_bytes = core::slice::from_raw_parts(
            (&enable_status as *const AuditStatus) as *const u8,
            core::mem::size_of::<AuditStatus>()
        );
        as_bytes.extend_from_slice(raw_bytes);
    }
    assert_eq!(as_bytes.len(), core::mem::size_of::<AuditStatus>(), "Serialized struct should match struct size");
    let payload: NlPayload<AuditType,PL> = NlPayload::Payload(as_bytes);

    let mut recv_handle = router.send::<AuditType,PL,AuditType,PL>(AuditType::SetStatus, NlmF::REQUEST, payload).await
        .expect("Should send");

    match recv_handle.next::<AuditType,PL>().await {
        Some(recvd) => {
            let msg = match recvd {
                Ok(msg) => msg,
                Err(e) => match e {
                    RouterError::BadSeqOrPid(msg) => msg, // This is necessary because the Audit subsystem ignores PIDs sometimes
                    _ => return Err("Unable to get message back".to_string()),
                }
            };
            info!("got msg {:?}", msg);
            info!("recv type {:?}", msg.nl_type());
            if let NlPayload::Payload(bytes) = msg.nl_payload() {
                info!("got payload of bytes {:?}", bytes);
            } else {
                error!("got other payload: {:?}", msg.nl_payload());
            }

            // Now wait a bit and we will see a panic in the router processing thread:
            tokio::time::sleep(std::time::Duration::from_secs(3)).await;

            Ok(())
        }
        None => Err("Should have received body".to_string())
    }
}

If you have auditd on your system, disable it with sudo auditctl -e 0. You will also need to run this command after you run the reproduction example. The reproduction only triggers a panic with Neli when the audit subsystem goes Off -> On.

To see the panic, run:

cargo b && sudo RUST_BACKTRACE=full ./target/debug/repro

or alternatively, if you don't want to use sudo directly on the executable:

sudo setcap CAP_AUDIT_CONTROL=+ep ./target/debug/repro && ./target/debug/repro

Here is a backtrace and the output of the program: panic.txt repro.txt

I have seen this behavior on both kernels 5.15.0-83 on Ubuntu and 6.1.41 on Gentoo on x86_64.

Expected behavior Don't panic when receiving an audit message, and provide an escape hatch to retrieve "leftover" bytes in a neli buffer whenever receiving a message.

Additionally, because the audit message group does not have the same behavior as other netlink families, I would like a way as a user to manually parse the "leftover" messages trailing the initial nlmsghdr struct.

Additional context The audit family has behavior inconsistent with the rest of Netlink. See commentary above.

Additional incentive When this is fixed I will contribute my code for the Audit family for Neli (see #96) which is currently broken by this bug.

jbaublitz commented 1 year ago

Hi @camconn! I would be happy to work on this with you. I'll have some follow up questions likely this weekend when I have a minute to look at this in more depth.

jbaublitz commented 1 year ago

@camconn Okay so mainly what I was wondering is whether there's any identifying information about the packets received after the initial one with the nlmsghdr. If you can point me to any docs you're aware of on the audit subsystem that would be great. I read through the linked article that I'd actually seen before when initially thinking about implementing the audit subsystem, and that seems to highlight a few areas I'll have to add exceptions for in the audit case.

This is also bringing back up the idea I had a while ago of having a quirks.rs file to segment off all of the "bad" netlink behavior in one spot.

camconn commented 1 year ago

@jbaublitz I did a little research reading Linux's documentation and the sources of kernel/audit*.c and here's what I found:

  1. There are no docs or specs on how audit messages are sent over the wire. The closest thing is the source code and this repo which documents the format of messages (not useful in this case).
  2. The reported packet length in the header can omit the length of the header itself.
  3. There is no explicit separator in the the Netlink packets between the header and the audit message.
  4. There is a ~standard~ typical structure for audit event/record packets: a. Audit events/records always have a type in the range [1300, 2999]. b. The entire audit event is sent in a single packet (struct sk_buff) to userland. c. Everything after the Netlink header (first 16 bytes or struct nlmsghdr) of an audit packet with a type in [1300, 2999] is the audit message.

Hope this helps.

jbaublitz commented 1 year ago

@camconn So if I'm understanding you correctly, I think what's probably happening is that the omission of the length of the header is the cause of the panics. That's my best guess knowing how the internals work. Theoretically, all other aspects of what you described should work if you implement a data structure that implements ToBytes/FromBytes for the audit format. Does that sound plausible to you?

camconn commented 1 year ago

@jbaublitz

the omission of the length of the header is the cause of the panics.

That's my impression of what's causing the panic. The tail end of the packet is getting interpreted as an additional header and the bogus length causes an OOB index triggering the panic.

Is there anything a user can do to work around this or would this require a patch to Neli?

jbaublitz commented 1 year ago

I think this would require a patch to neli. I'm not able to get around this until after the 25th, but I will work on it as soon as I can after that.