rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 206 forks source link

questions about wordcount example #127

Open Bran-Sun opened 3 years ago

Bran-Sun commented 3 years ago

I write a WordCount example with your framework as follows. It only processes a 17-lines text but takes 240s to finish on my computer. Why does it run so slow?

use chrono::prelude::*;
use vega::io::*;
use vega::*;
use std::fs::File;

fn main() -> Result<()> {
    let context = Context::new()?;

    let num_splits = 4;
    let deserializer = Fn!(|file: Vec<u8>| {
        String::from_utf8(file)
        .unwrap()
        .lines()
        .map(|s| s.to_string())
        .collect::<Vec<_>>()
    });
    let lines = context
                .read_source(LocalFsReaderConfig::new("./README.md"), deserializer)
                .flat_map(Fn!(|lines: Vec<String>| {
                    Box::new(lines.into_iter()) as Box<dyn Iterator<Item = _>>
                }));

    let words = lines.flat_map(Fn!(|line: String| {
        Box::new(line.split(' ').map(|s| (s.to_string(), 1)).collect::<Vec<_>>().into_iter()) as Box<dyn Iterator<Item = _>>
    }));

    let result = words.reduce_by_key(Fn!(|(a, b)| a + b), num_splits);

    let output = result.collect().unwrap();

    println!("result: {:?}", output);

    Ok(())
}
rajasekarv commented 3 years ago

Hello, Sorry for a very late reply. I was taking some break from maintaining the public branch of this library for some time. Hence the delay.

240s doesn't seem correct. Can you provide more details? Maybe you are taking initial compilation time also into account?