louis030195 / screen-pipe

Turn your screen into actions (using LLMs). Inspired by adept.ai, rewind.ai, Apple Shortcut. Rust + WASM.
https://screenpi.pe
MIT License
81 stars 1 forks source link

Should we compress screenshots? #8

Open ashgansh opened 4 days ago

ashgansh commented 4 days ago

tl;dr

I started experimenting with downscaling images. If the purpose is to pipe content to LLM i think it would be beneficial to reduce the amount of input tokens we would send. (the llm doesn't care if an image looks good so we should be able to make some interesting tradeoffs). At the moment I feel that this implementation is not the way to go, but thought I might share it here to facilitate future work on this.

More Info

Screen pipe generates screenshots around ~10MB.

I tried to modify the code to add compression (see full source below ).

problem is that it puts a lot of strain on cpu + it's really slow around 2-3s per screenshot. probably some ways to optimize though.

Screenshot 2024-06-27 at 15 24 55

ran some prototypes with downscaling 2x + jpeg conversions with screenpipe

Naive implementation:

use chrono::Local;
use clap::Parser;
use crossbeam::channel;
use image::{ImageBuffer, ImageEncoder, DynamicImage, imageops::FilterType, ColorType};
use std::fs::{create_dir_all, File};
use std::io::{BufWriter, Cursor, Write};
use std::path::Path;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::thread::{self, sleep};
use std::time::{Duration, Instant};
use xcap::Monitor;
use rayon::prelude::*; // Add rayon for parallel processing

const DISPLAY: &str = r"
      ___         ___         ___         ___         ___         ___                   ___                 ___      ___     
     /  /\       /  /\       /  /\       /  /\       /  /\       /__/\                 /  /\    ___        /  /\    /  /\    
    /  /:/_     /  /:/      /  /::\     /  /:/_     /  /:/_      \  \:\               /  /::\  /  /\      /  /::\  /  /:/_   
   /  /:/ /\   /  /:/      /  /:/\:\   /  /:/ /\   /  /:/ /\      \  \:\             /  /:/\:\/  /:/     /  /:/\:\/  /:/ /\  
  /  /:/ /::\ /  /:/  ___ /  /:/~/:/  /  /:/ /:/_ /  /:/ /:/_ _____\__\:\           /  /:/~/:/__/::\    /  /:/~/:/  /:/ /:/_ 
 /__/:/ /:/\:/__/:/  /  //__/:/ /:/__/__/:/ /:/ //__/:/ /:/ //__/::::::::\         /__/:/ /:/\__\/\:\__/__/:/ /:/__/:/ /:/ /\
 \  \:\/:/~/:\  \:\ /  /:\  \:\/:::::\  \:\/:/ /:\  \:\/:/ /:\  \:\~~\~~\/         \  \:\/:/    \  \:\/\  \  \:\/:/\  \:\/:/ /:/
  \  \::/ /:/ \  \:\  /:/ \  \::/~~~~ \  \::/ /:/ \  \::/ /:/ \  \:\  ~~~           \  \::/      \__\::/\  \::/  \  \::/ /:/ 
   \__\/ /:/   \  \:\/:/   \  \:\      \  \:\/:/   \  \:\/:/   \  \:\                \  \:\      /__/:/  \  \:\   \  \:\/:/  
     /__/:/     \  \::/     \  \:\      \  \::/     \  \::/     \  \:\                \  \:\     \__\/    \  \:\   \  \:\/:/  
     \__\/       \__\/       \__\/       \__\/       \__\/       \__\/                 \__\/               \__\/    \__\/    

";

#[derive(Parser)]
#[command(name = "screenpipe")]
#[command(about = "A tool to capture screenshots at regular intervals", long_about = None)]
struct Cli {
    /// Path to save screenshots
    #[arg(short, long, default_value = "target/screenshots")]
    path: String,

    /// Interval in seconds between screenshots (can be float, by default no delay)
    #[arg(short, long, default_value_t = 0.0)]
    interval: f32,

    /// Downscale factor (e.g., 2 means half the original size)
    #[arg(short, long, default_value_t = 1)]
    downscale: u32,

    /// Convert to grayscale
    #[arg(short, long)]
    grayscale: bool,

    /// Compress the output image
    #[arg(short, long)]
    compress: bool,
}

fn normalized(filename: &str) -> String {
    filename.replace("|", "")
           .replace("\\", "")
           .replace(":", "")
           .replace("/", "")
}

fn process_image(
    monitor: &Monitor,
    downscale: u32,
    frame_count: u32,
    sub_dir: &str,
    compress: bool,
) -> (Vec<u8>, String) {
    // Start timing the entire process
    let total_start = Instant::now();

    // Capture the image from the monitor
    let capture_start = Instant::now();
    let xcap_image = monitor.capture_image().unwrap();
    let capture_duration = capture_start.elapsed();
    println!("Image capture took: {:?}", capture_duration);

    // Get image dimensions
    let width = xcap_image.width() as u32;
    let height = xcap_image.height() as u32;

    // Convert raw image data to DynamicImage
    let conversion_start = Instant::now();
    let mut image: DynamicImage = ImageBuffer::from_raw(width, height, xcap_image.into_raw())
        .map(DynamicImage::ImageRgba8)
        .unwrap();
    let conversion_duration = conversion_start.elapsed();
    println!("Image conversion took: {:?}", conversion_duration);

    // Downscale the image
    let downscale = downscale.max(1);
    let new_width = width / downscale;
    let new_height = height / downscale;
    let resize_start = Instant::now();
    image = image.resize_exact(new_width, new_height, FilterType::Nearest);
    let resize_duration = resize_start.elapsed();
    println!("Image resize took: {:?}", resize_duration);

    // Convert to RGB
    let rgb_conversion_start = Instant::now();
    let rgb_image = image.to_rgb8();
    let rgb_conversion_duration = rgb_conversion_start.elapsed();
    println!("RGB conversion took: {:?}", rgb_conversion_duration);

    // Compress the image
    let compress_start = Instant::now();
    let mut jpg_data = Vec::new();
    let mut cursor = Cursor::new(&mut jpg_data);
    let quality = if compress { 70 } else { 100 };
    image::codecs::jpeg::JpegEncoder::new_with_quality(&mut cursor, quality)
        .write_image(
            rgb_image.as_raw(),
            new_width,
            new_height,
            ColorType::Rgb8.into(),
        )
        .unwrap();
    let compress_duration = compress_start.elapsed();
    println!("Image compression took: {:?}", compress_duration);

    // Generate the filename
    let filename = format!(
        "{}/monitor-{}-{}.jpg",
        sub_dir,
        normalized(monitor.name()),
        frame_count
    );

    // Total duration
    let total_duration = total_start.elapsed();
    println!("Total image processing took: {:?}", total_duration);

    (jpg_data, filename)
}

fn screenpipe(cli: &Cli, running: Arc<AtomicBool>) {
    if !Path::new(&cli.path).exists() {
        create_dir_all(&cli.path).unwrap();
    }

    let monitors = Monitor::all().unwrap();
    let mut frame_count = 0;

    println!("Found {} monitors", monitors.len());
    println!("Screenshots will be saved to {}", cli.path);
    println!("Interval: {} seconds", cli.interval);
    println!("Press Ctrl+C to stop");
    println!("{}", DISPLAY);

    let (tx, rx) = channel::bounded::<(Vec<u8>, String)>(monitors.len() * 2);

    let save_thread = thread::spawn(move || {
        while let Ok((image_data, filename)) = rx.recv() {
            let file = File::create(&filename).unwrap();
            let mut writer = BufWriter::new(file);
            writer.write_all(&image_data).unwrap();
        }
    });

    while running.load(Ordering::Relaxed) {
        let start_time = Instant::now();

        let day_dir = format!("{}/{}", cli.path, Local::now().format("%Y-%m-%d"));
        create_dir_all(&day_dir).unwrap();

        let sub_dir = format!("{}/{}", day_dir, frame_count / 60);
        create_dir_all(&sub_dir).unwrap();

        monitors.par_iter().for_each(|monitor| {
            let (image_data, filename) = process_image(
                monitor,
                cli.downscale,
                frame_count,
                &sub_dir,
                cli.compress,
            );
            tx.send((image_data, filename)).unwrap();
        });

        println!("Captured screens. Frame: {}", frame_count);

        let elapsed = start_time.elapsed();
        if elapsed < Duration::from_secs_f32(cli.interval) {
            sleep(Duration::from_secs_f32(cli.interval) - elapsed);
        }

        frame_count += 1;
    }

    drop(tx);
    save_thread.join().unwrap();
}

fn main() {
    let cli = Cli::parse();
    let running = Arc::new(AtomicBool::new(true));
    let r = running.clone();

    ctrlc::set_handler(move || {
        r.store(false, Ordering::Relaxed);
    })
    .expect("Error setting Ctrl-C handler");

    screenpipe(&cli, running);
}
louis030195 commented 4 days ago

according to claude:

Example: Screen Recording 24/7 in Different Formats

Assumptions

MP4 (Video)

JPEG (Image)

PNG (Image)

Summary

Recommendation

For continuous screen recording, MP4 is the most efficient in terms of storage, balancing quality and file size.

louis030195 commented 4 days ago

https://github.com/nashaofu/xcap/issues/137

louis030195 commented 4 days ago

rewind:

image
louis030195 commented 4 days ago

my idea is to make screenpi.pe usable either:

in terms of storage

but also in terms of compute

say post processing or compression could be done in the cloud too to save compute locally but adding network load

louis030195 commented 4 days ago

@ashgansh

I started experimenting with downscaling images. If the purpose is to pipe content to LLM i think it would be beneficial to reduce the amount of input tokens we would send. (the llm doesn't care if an image looks good so we should be able to make some interesting tradeoffs). At the moment I feel that this implementation is not the way to go, but thought I might share it here to facilitate future work on this.

yeah i think now running 100% llama3 24/7 my mac go fire

probably again an hybrid approach smaller models with larger models for different use cases

ashgansh commented 4 days ago

i think .mp4 is good for the rewind use case. so it depends in which direction screenpipe wants to go.

either: a) pure piping: data goes to stdout to and another unix-like tool offload takes next task egg screenpipe | llm "prompt" b) storage screenpipe --location=/some/path

i don't think you'll be able to do a with .mp4 (or maybe i just don't see it?). so it would force screenpipe into a b type solution.

louis030195 commented 2 days ago

@ashgansh

i think .mp4 is good for the rewind use case. so it depends in which direction screenpipe wants to go.

either: a) pure piping: data goes to stdout to and another unix-like tool offload takes next task egg screenpipe | llm "prompt" b) storage screenpipe --location=/some/path

i don't think you'll be able to do a with .mp4 (or maybe i just don't see it?). so it would force screenpipe into a b type solution.

curious to know why do you think the unix-like pipe approach is interesting? still considering to separate responsibilities

so it would be like:

screenpipe 
# here it would stream json objects containing screenshots, text, audio, metadata, etc.
# could be used like
screenpipe | jq '.[.audio]' | whisper | chatgpt "how many time did i use hedge words"
screenpipe | jq '.[.text]' | chatgpt "keep log of my day"
screenpipe | jq '.[.metadata.app]' | chatgpt "maintain a markdown table of how much time i spend on apps"

or through SDK:

const screenPipe = new ScreenPipe();
for (const tick of await screenPipe.stream()) {
  db.from("memories").add(tick)
}
const screenPipe = new ScreenPipe();
for (const tick of await screenPipe.stream()) {
  s3.store("/memories/"+new Date()).add(tick)
}

or similar

so the screenpipe package/lib/cli/sdk would only contains the code gathering consumer hardware info (computer' inputs & outputs) and stream to stdout or sdk

what are pros & cons?