Should we compress screenshots?

benjaminshafii commented 3 months ago

tl;dr

I started experimenting with downscaling images. If the purpose is to pipe content to LLM i think it would be beneficial to reduce the amount of input tokens we would send. (the llm doesn't care if an image looks good so we should be able to make some interesting tradeoffs). At the moment I feel that this implementation is not the way to go, but thought I might share it here to facilitate future work on this.

More Info

Screen pipe generates screenshots around ~10MB.

I tried to modify the code to add compression (see full source below ).

problem is that it puts a lot of strain on cpu + it's really slow around 2-3s per screenshot. probably some ways to optimize though.

ran some prototypes with downscaling 2x + jpeg conversions with screenpipe

Naive implementation:

use chrono::Local;
use clap::Parser;
use crossbeam::channel;
use image::{ImageBuffer, ImageEncoder, DynamicImage, imageops::FilterType, ColorType};
use std::fs::{create_dir_all, File};
use std::io::{BufWriter, Cursor, Write};
use std::path::Path;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::thread::{self, sleep};
use std::time::{Duration, Instant};
use xcap::Monitor;
use rayon::prelude::*; // Add rayon for parallel processing

const DISPLAY: &str = r"
      ___         ___         ___         ___         ___         ___                   ___                 ___      ___     
     /  /\       /  /\       /  /\       /  /\       /  /\       /__/\                 /  /\    ___        /  /\    /  /\    
    /  /:/_     /  /:/      /  /::\     /  /:/_     /  /:/_      \  \:\               /  /::\  /  /\      /  /::\  /  /:/_   
   /  /:/ /\   /  /:/      /  /:/\:\   /  /:/ /\   /  /:/ /\      \  \:\             /  /:/\:\/  /:/     /  /:/\:\/  /:/ /\  
  /  /:/ /::\ /  /:/  ___ /  /:/~/:/  /  /:/ /:/_ /  /:/ /:/_ _____\__\:\           /  /:/~/:/__/::\    /  /:/~/:/  /:/ /:/_ 
 /__/:/ /:/\:/__/:/  /  //__/:/ /:/__/__/:/ /:/ //__/:/ /:/ //__/::::::::\         /__/:/ /:/\__\/\:\__/__/:/ /:/__/:/ /:/ /\
 \  \:\/:/~/:\  \:\ /  /:\  \:\/:::::\  \:\/:/ /:\  \:\/:/ /:\  \:\~~\~~\/         \  \:\/:/    \  \:\/\  \  \:\/:/\  \:\/:/ /:/
  \  \::/ /:/ \  \:\  /:/ \  \::/~~~~ \  \::/ /:/ \  \::/ /:/ \  \:\  ~~~           \  \::/      \__\::/\  \::/  \  \::/ /:/ 
   \__\/ /:/   \  \:\/:/   \  \:\      \  \:\/:/   \  \:\/:/   \  \:\                \  \:\      /__/:/  \  \:\   \  \:\/:/  
     /__/:/     \  \::/     \  \:\      \  \::/     \  \::/     \  \:\                \  \:\     \__\/    \  \:\   \  \:\/:/  
     \__\/       \__\/       \__\/       \__\/       \__\/       \__\/                 \__\/               \__\/    \__\/    

";

#[derive(Parser)]
#[command(name = "screenpipe")]
#[command(about = "A tool to capture screenshots at regular intervals", long_about = None)]
struct Cli {
    /// Path to save screenshots
    #[arg(short, long, default_value = "target/screenshots")]
    path: String,

    /// Interval in seconds between screenshots (can be float, by default no delay)
    #[arg(short, long, default_value_t = 0.0)]
    interval: f32,

    /// Downscale factor (e.g., 2 means half the original size)
    #[arg(short, long, default_value_t = 1)]
    downscale: u32,

    /// Convert to grayscale
    #[arg(short, long)]
    grayscale: bool,

    /// Compress the output image
    #[arg(short, long)]
    compress: bool,
}

fn normalized(filename: &str) -> String {
    filename.replace("|", "")
           .replace("\\", "")
           .replace(":", "")
           .replace("/", "")
}

fn process_image(
    monitor: &Monitor,
    downscale: u32,
    frame_count: u32,
    sub_dir: &str,
    compress: bool,
) -> (Vec<u8>, String) {
    // Start timing the entire process
    let total_start = Instant::now();

    // Capture the image from the monitor
    let capture_start = Instant::now();
    let xcap_image = monitor.capture_image().unwrap();
    let capture_duration = capture_start.elapsed();
    println!("Image capture took: {:?}", capture_duration);

    // Get image dimensions
    let width = xcap_image.width() as u32;
    let height = xcap_image.height() as u32;

    // Convert raw image data to DynamicImage
    let conversion_start = Instant::now();
    let mut image: DynamicImage = ImageBuffer::from_raw(width, height, xcap_image.into_raw())
        .map(DynamicImage::ImageRgba8)
        .unwrap();
    let conversion_duration = conversion_start.elapsed();
    println!("Image conversion took: {:?}", conversion_duration);

    // Downscale the image
    let downscale = downscale.max(1);
    let new_width = width / downscale;
    let new_height = height / downscale;
    let resize_start = Instant::now();
    image = image.resize_exact(new_width, new_height, FilterType::Nearest);
    let resize_duration = resize_start.elapsed();
    println!("Image resize took: {:?}", resize_duration);

    // Convert to RGB
    let rgb_conversion_start = Instant::now();
    let rgb_image = image.to_rgb8();
    let rgb_conversion_duration = rgb_conversion_start.elapsed();
    println!("RGB conversion took: {:?}", rgb_conversion_duration);

    // Compress the image
    let compress_start = Instant::now();
    let mut jpg_data = Vec::new();
    let mut cursor = Cursor::new(&mut jpg_data);
    let quality = if compress { 70 } else { 100 };
    image::codecs::jpeg::JpegEncoder::new_with_quality(&mut cursor, quality)
        .write_image(
            rgb_image.as_raw(),
            new_width,
            new_height,
            ColorType::Rgb8.into(),
        )
        .unwrap();
    let compress_duration = compress_start.elapsed();
    println!("Image compression took: {:?}", compress_duration);

    // Generate the filename
    let filename = format!(
        "{}/monitor-{}-{}.jpg",
        sub_dir,
        normalized(monitor.name()),
        frame_count
    );

    // Total duration
    let total_duration = total_start.elapsed();
    println!("Total image processing took: {:?}", total_duration);

    (jpg_data, filename)
}

fn screenpipe(cli: &Cli, running: Arc<AtomicBool>) {
    if !Path::new(&cli.path).exists() {
        create_dir_all(&cli.path).unwrap();
    }

    let monitors = Monitor::all().unwrap();
    let mut frame_count = 0;

    println!("Found {} monitors", monitors.len());
    println!("Screenshots will be saved to {}", cli.path);
    println!("Interval: {} seconds", cli.interval);
    println!("Press Ctrl+C to stop");
    println!("{}", DISPLAY);

    let (tx, rx) = channel::bounded::<(Vec<u8>, String)>(monitors.len() * 2);

    let save_thread = thread::spawn(move || {
        while let Ok((image_data, filename)) = rx.recv() {
            let file = File::create(&filename).unwrap();
            let mut writer = BufWriter::new(file);
            writer.write_all(&image_data).unwrap();
        }
    });

    while running.load(Ordering::Relaxed) {
        let start_time = Instant::now();

        let day_dir = format!("{}/{}", cli.path, Local::now().format("%Y-%m-%d"));
        create_dir_all(&day_dir).unwrap();

        let sub_dir = format!("{}/{}", day_dir, frame_count / 60);
        create_dir_all(&sub_dir).unwrap();

        monitors.par_iter().for_each(|monitor| {
            let (image_data, filename) = process_image(
                monitor,
                cli.downscale,
                frame_count,
                &sub_dir,
                cli.compress,
            );
            tx.send((image_data, filename)).unwrap();
        });

        println!("Captured screens. Frame: {}", frame_count);

        let elapsed = start_time.elapsed();
        if elapsed < Duration::from_secs_f32(cli.interval) {
            sleep(Duration::from_secs_f32(cli.interval) - elapsed);
        }

        frame_count += 1;
    }

    drop(tx);
    save_thread.join().unwrap();
}

fn main() {
    let cli = Cli::parse();
    let running = Arc::new(AtomicBool::new(true));
    let r = running.clone();

    ctrlc::set_handler(move || {
        r.store(false, Ordering::Relaxed);
    })
    .expect("Error setting Ctrl-C handler");

    screenpipe(&cli, running);
}

louis030195 commented 3 months ago

according to claude:

Example: Screen Recording 24/7 in Different Formats

Assumptions

Resolution: 1920x1080 (Full HD)
Frame Rate: 30 fps
Duration: 24 hours/day, 30 days/month

MP4 (Video)

Bitrate: 2 Mbps (reasonable quality for screen recording)
File Size Calculation:
- 2 Mbps = 0.25 MB/s
- 0.25 MB/s 60 s/min 60 min/h 24 h/day 30 days/month
- ≈ 648,000 MB/month
- ≈ 648 GB/month

JPEG (Image)

Screenshot Frequency: 1 screenshot per second
Average File Size: 200 KB per image
File Size Calculation:
- 200 KB 60 s/min 60 min/h 24 h/day 30 days/month
- ≈ 518,400,000 KB/month
- ≈ 518,400 MB/month
- ≈ 518 GB/month

PNG (Image)

Screenshot Frequency: 1 screenshot per second
Average File Size: 1 MB per image
File Size Calculation:
- 1 MB 60 s/min 60 min/h 24 h/day 30 days/month
- ≈ 2,592,000 MB/month
- ≈ 2,592 GB/month

Summary

MP4: ~648 GB/month
JPEG: ~518 GB/month
PNG: ~2,592 GB/month

Recommendation

For continuous screen recording, MP4 is the most efficient in terms of storage, balancing quality and file size.

louis030195 commented 3 months ago

https://github.com/nashaofu/xcap/issues/137

louis030195 commented 3 months ago

rewind:

louis030195 commented 3 months ago

my idea is to make screenpi.pe usable either:

100% locally
100% cloud
hybrid

in terms of storage

but also in terms of compute

say post processing or compression could be done in the cloud too to save compute locally but adding network load

louis030195 commented 3 months ago

@ashgansh

I started experimenting with downscaling images. If the purpose is to pipe content to LLM i think it would be beneficial to reduce the amount of input tokens we would send. (the llm doesn't care if an image looks good so we should be able to make some interesting tradeoffs). At the moment I feel that this implementation is not the way to go, but thought I might share it here to facilitate future work on this.

yeah i think now running 100% llama3 24/7 my mac go fire

probably again an hybrid approach smaller models with larger models for different use cases

benjaminshafii commented 3 months ago

i think .mp4 is good for the rewind use case. so it depends in which direction screenpipe wants to go.

either: a) pure piping: data goes to stdout to and another unix-like tool offload takes next task egg screenpipe | llm "prompt" b) storage screenpipe --location=/some/path

i don't think you'll be able to do a with .mp4 (or maybe i just don't see it?). so it would force screenpipe into a b type solution.

louis030195 commented 3 months ago

@ashgansh

i think .mp4 is good for the rewind use case. so it depends in which direction screenpipe wants to go.

either: a) pure piping: data goes to stdout to and another unix-like tool offload takes next task egg screenpipe | llm "prompt" b) storage screenpipe --location=/some/path

i don't think you'll be able to do a with .mp4 (or maybe i just don't see it?). so it would force screenpipe into a b type solution.

curious to know why do you think the unix-like pipe approach is interesting? still considering to separate responsibilities

so it would be like:

screenpipe 
# here it would stream json objects containing screenshots, text, audio, metadata, etc.
# could be used like
screenpipe | jq '.[.audio]' | whisper | chatgpt "how many time did i use hedge words"
screenpipe | jq '.[.text]' | chatgpt "keep log of my day"
screenpipe | jq '.[.metadata.app]' | chatgpt "maintain a markdown table of how much time i spend on apps"

or through SDK:

const screenPipe = new ScreenPipe();
for (const tick of await screenPipe.stream()) {
  db.from("memories").add(tick)
}

const screenPipe = new ScreenPipe();
for (const tick of await screenPipe.stream()) {
  s3.store("/memories/"+new Date()).add(tick)
}

or similar

so the screenpipe package/lib/cli/sdk would only contains the code gathering consumer hardware info (computer' inputs & outputs) and stream to stdout or sdk

what are pros & cons?

louis030195 commented 3 months ago

@ashgansh fyi i've been reflecting on this and ended up trying to split properly responsibilities in the branch "audio" (after seeing the code was getting too messy)

idea is to have:

screenpipe-vision: capture vision + processing like OCR (local or remote) - stream data to stdout / sdk (no db, api code)
screenpipe-audio: capture audio + processing like transcription (local or remote) - stream data to stdout / sdk (no db, api code)
screenpipe-server: like on main, mostly glue of vision into mp4 videos, audio, api & db

i'm trying to design this lib so that it's easy to extend it with typescript instead of rust because i've noticed 99.9% of programmer seems afraid of rust

ideally it would be easy to go from screenpipe to building nextjs apps

e.g. 30s to get started w screenpipe and max 5 min for prod config plumbing between screenpipe and your preferred compute/storage that runs 24/7 and connected to a nextjs UI

DmacMcgreg commented 2 months ago

@louis030195 you might have some luck compressing jpeg's into mp4s. I'm using another open source tool and I've been able to compress over 100gb of 4k jpegs down into about 1GB per month. 1 screenshot every 10 seconds. I'm not sure how you could index that one mp4 file though.. that might be the tricky part.

louis030195 commented 2 months ago

@DmacMcgreg we already indeed encode all frames and audio in mp4 thats how we only use <30gb/m even though recording multi audio + screen 24/7 :)

mediar-ai / screenpipe