rust-lang / hashbrown

Rust port of Google's SwissTable hash map
https://rust-lang.github.io/hashbrown
Apache License 2.0
2.46k stars 288 forks source link

Memory leak #548

Closed DimitriTimoz closed 2 months ago

DimitriTimoz commented 2 months ago

I have a segfault randomly with the following code, I ran it with valgrind and I got this memory leak.

rustc 1.80.1 (3f5fd8dd4 2024-08-06) Ubuntu

Code

extern crate spider;
extern crate env_logger;

use std::hash::{DefaultHasher, Hash, Hasher};

use newspaper::{Newspaper, NewspaperModel, Paper};
use spider::website::Website;
use spider::tokio;
use env_logger::Env;

use meilisearch_sdk::client::*;

pub mod newspaper;

async fn indexing(papers: &[Paper]) {
    println!("Indexing {} papers", papers.len());
    let client = Client::new("http://localhost:7700", Some("a")).unwrap();

    let res = client.index("papers").add_documents(papers, Some("hash_url")).await;
    match res {
        Ok(_) => println!("Indexing done"),
        Err(e) => println!("Error: {}", e),
    }
}

#[tokio::main]
async fn main() {
    let env = Env::default()
    .filter_or("RUST_LOG", "error")
    .write_style_or("RUST_LOG_STYLE", "always");

    let file = std::fs::File::open("newspapers.json").unwrap();
    let newspapers: Vec<NewspaperModel> = serde_json::from_reader(file).unwrap();

    let newspapers = newspapers.iter().map(|newspaper| Newspaper::from(newspaper.clone())).collect::<Vec<_>>();
    let mut hasher: DefaultHasher = DefaultHasher::new();
    // TODO: Use a thread pool to scrape multiple websites concurrently
    for paper in newspapers {
       let mut website = Website::new(paper.get_url().as_str());
        website.with_limit(100);
        println!("Scraping {}", paper.get_title());
        website.scrape().await;
        println!("Scraping done");
        let mut papers = Vec::new();
        if let Some(pages) = website.get_pages() {
            println!("Scraping pages");
            for page in pages.as_ref() {
                let html = page.get_html();

                let document = scraper::Html::parse_document(&html);
                for selector in paper.get_selectors() {
                    let texts = document.select(selector).flat_map(|el| el.text()).collect::<Vec<_>>();
                    if texts.is_empty() {
                        continue;
                    }
                    page.get_url().hash(&mut hasher);
                    papers.push(Paper {
                        title: page.get_url().to_string(),
                        url: page.get_url().to_string(),
                        content: texts.join(" "),
                        hash_url: hasher.finish()
                    });
                    break;    
                }

                if papers.len() >= 500 {
                    indexing(papers.as_slice()).await;
                    papers.clear();
                }
            }
        }
        if papers.is_empty() {
            continue;
        }
        indexing(papers.as_slice()).await;
    }
}

Valgrind

Indexing done
Error leaked 816 B in 1 block
        Info at malloc
             at alloc::alloc::alloc (alloc.rs:100)
             at alloc_impl (global.rs:35)
             at allocate (global.rs:100)
             at hashbrown::raw::inner::alloc::inner::do_alloc (alloc.rs:36)
             at hashbrown::raw::inner::RawTableInner::new_uninitialized (mod.rs:1750)
             at hashbrown::raw::inner::RawTableInner::fallible_with_capacity (mod.rs:1788)
             at hashbrown::raw::inner::RawTableInner::with_capacity (mod.rs:1815)
             at hashbrown::raw::inner::RawTable<T,A>::with_capacity_in (mod.rs:901)
             at hashbrown::raw::inner::RawTable<T>::with_capacity (mod.rs:836)
             at hashbrown::map::HashMap<K,V,S>::with_capacity_and_hasher (map.rs:505)
             at hashbrown::map::HashMap<K,V>::with_capacity (map.rs:324)
             at hashbrown::set::HashSet<T>::with_capacity (set.rs:190)
             at __static_ref_initialize (page.rs:68)
             at core::ops::function::FnOnce::call_once (function.rs:250)
     Summary Leaked 816 B total

Cargo tree

crawler v0.1.0 (/home/dimitri/Development/fact-cheker/crawler)
├── meilisearch-sdk v0.27.1
│   ├── reqwest v0.12.7
│   │   ├── h2 v0.4.6
│   │   │   ├── indexmap v2.4.0
│   │   │   │   └── hashbrown v0.14.5
├── spider v2.1.2
│   ├── hashbrown v0.14.5 (*)

Here is the page.rs that could be the reason of this leak: here

Amanieu commented 2 months ago

That's normal: it's how lazy_static is supposed to work. It allocates a HashSet and keeps it for the entire lifetime of the process. At the end it doesn't need to free it since the process is exiting anyways.

Closing since this isn't a bug in hashbrown.