utkarshkukreti / select.rs

A Rust library to extract useful data from HTML documents, suitable for web scraping.
MIT License
959 stars 69 forks source link

Panic due to assertion failed: c.is_some() in html5ever-0.18.0/src/tokenizer/mod.rs:555:9 #54

Closed Alvenix closed 5 years ago

Alvenix commented 5 years ago

The following code cause panic:

use std::fs::File;
use std::io::prelude::*;

#[macro_use]
extern crate error_chain;

use select::document::Document;

error_chain! {
    foreign_links {
        Reqwest(reqwest::Error);
        Std(std::io::Error);
    }
}

fn main() -> Result<()> {
    /*
     * This download the file
     *
     */
    // let html = reqwest::get("http://sampsonsheriff.com/")?.text()?;
    // let mut file = File::create("src/panic.txt")?;
    // file.write_all(html.as_bytes())?;

    let document = Document::from(include_str!("panic.txt"));

    Ok(())
}

Here is the stack trace:

thread 'main' panicked at 'assertion failed: c.is_some()', /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:555:9
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:70
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:58
             at src/libstd/panicking.rs:200
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:215
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:478
   5: std::panicking::begin_panic
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libstd/panicking.rs:412
   6: <html5ever::tokenizer::Tokenizer<Sink>>::discard_char
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/select-0.4.2/<::std::macros::panic macros>:3
   7: <html5ever::tokenizer::Tokenizer<Sink>>::step
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:570
   8: <html5ever::tokenizer::Tokenizer<Sink>>::run
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:362
   9: <html5ever::tokenizer::Tokenizer<Sink>>::feed
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:220
  10: <html5ever::driver::Parser<Sink> as tendril::stream::TendrilSink<tendril::fmt::UTF8>>::process
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/driver.rs:88
  11: tendril::stream::TendrilSink::one
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/tendril-0.3.1/src/stream.rs:47
  12: <select::document::Document as core::convert::From<tendril::tendril::Tendril<tendril::fmt::UTF8>>>::from
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/select-0.4.2/src/document.rs:53
  13: <select::document::Document as core::convert::From<&'a str>>::from
             at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/select-0.4.2/src/document.rs:133
  14: reqwest_test::main
             at src/main.rs:26
  15: std::rt::lang_start::{{closure}}
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libstd/rt.rs:64
  16: std::panicking::try::do_call
             at src/libstd/rt.rs:49
             at src/libstd/panicking.rs:297
  17: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:92
  18: std::rt::lang_start_internal
             at src/libstd/panicking.rs:276
             at src/libstd/panic.rs:388
             at src/libstd/rt.rs:48
  19: std::rt::lang_start
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libstd/rt.rs:64
  20: main
  21: __libc_start_main
  22: _start

I have attached the file in case the website change. panic.txt

tatref commented 5 years ago

This is in html5ever 0.18. (used by select 0.4.2). However, in the repo, html5ever is 0.22, which fixes the issue.

So at this point, you can clone the repo, and add to your Cargo.toml:

[dependencies]
select = { path = "./select.rs" }

Second option is if utkarshkukreti releases a new version

utkarshkukreti commented 5 years ago

I have released v0.4.3 with all the dependency updates. Please let me know if you still run into this.