laysakura / trie-rs

Memory efficient trie (prefix tree) library based on LOUDS
https://crates.io/crates/trie-rs
Apache License 2.0
90 stars 10 forks source link

Updated version of pull request #25 #28

Closed LucaCappelletti94 closed 4 months ago

LucaCappelletti94 commented 4 months ago

Updated version of pull request #25

Once merged, this pull request resolves the issues #23 and #24

shanecelis commented 4 months ago

Will mem_dbg let us compare the size of the dictionary versus the size of the trie like this issue aims to?

LucaCappelletti94 commented 4 months ago

Yes, it should easily be able to do that. If you scroll down in the readme of this other project, you will find mem_dbg examples comparing trie, a simple Vec and an RCL. Next up I will add an RCL compressed with Huffman.

shanecelis commented 4 months ago

Hi Luca,

This mem_dbg stuff is cool. I wrote a little exploratory code to get a feel for it that I'd like to turn into a test.

    #[cfg(feature = "mem_dbg")]
    #[test]
    fn memsize() {
        use std::{env, io::{BufReader, BufRead}, fs::{File}};
        use mem_dbg::*;

        let mut builder = TrieBuilder::new();

        let repo_root = env::var("CARGO_MANIFEST_DIR").expect("CARGO_MANIFEST_DIR environment variable must be set.");
        let edict2_path = format!("{}/benches/edict.furigana", repo_root);
        println!("Reading dictionary file from: {}", edict2_path);

        let mut n_words = 0;
        for result in BufReader::new(File::open(edict2_path).unwrap()).lines() {
            let l = result.unwrap();
            builder.push(l);
            n_words += 1;
        }
        println!("Read {} words.", n_words);

        let trie = builder.build();
        let size = trie.mem_size(SizeFlags::default());
        eprintln!("Trie size {size}");
        let uncompressed: Vec<String> = trie.predictive_search("").collect();
        eprintln!("Uncompressed size {}", uncompressed.mem_size(SizeFlags::default()));

    }

The file itself is 3,717,355 bytes. When I run the above

$ cargo t --features mem_dbg memsize -- --nocapture
...
Read 185536 words.
Trie size 4174393
Uncompressed size 7984707

I'm surprised the trie is bigger than the dictionary file itself. I would have expected it to be smaller. Fortunately, it is half the size of the "uncompressed" version of the contents.

Collecting the uncompressed version takes a while, so it should probably not be kept as a unit test. Maybe it could be a benchmark somehow.

shanecelis commented 4 months ago

I'm intent on merging this. Can you update the Changelog?

LucaCappelletti94 commented 4 months ago

Sure, on it.

LucaCappelletti94 commented 4 months ago

Should I remove the travis badge? It is no longer working since travis became a paid services.

LucaCappelletti94 commented 4 months ago

Updated changelog - feel free to publish. I would suggest to remove the travis stuff before.

shanecelis commented 4 months ago

Thanks for the PR.