neon-mmd / websurfx

:rocket: An open source alternative to searx which provides a modern-looking :sparkles:, lightning-fast :zap:, privacy respecting :disguised_face:, secure :lock: meta search engine
https://github.com/neon-mmd/websurfx/tree/rolling/docs
GNU Affero General Public License v3.0
722 stars 94 forks source link

👽️ `Duckduckgo` engine code according to the new `html` changes #432

Closed neon-mmd closed 10 months ago

neon-mmd commented 10 months ago

Description

Provide a fix for the duckduckgo upstream engine code by updating the code according to the recent API/html changes.

To provide the fix for the duckduckgo engine, just update the duckduckgo.rs file located in the src/engines/ directory located under the codebase (websurfx directory) with the following changes as shown below:

//! The `duckduckgo` module handles the scraping of results from the duckduckgo search engine
//! by querying the upstream duckduckgo search engine with user provided query and with a page
//! number if provided.

use std::collections::HashMap;

use reqwest::header::HeaderMap;
use reqwest::Client;
use scraper::Html;

use crate::models::aggregation_models::SearchResult;

use crate::models::engine_models::{EngineError, SearchEngine};

use error_stack::{Report, Result, ResultExt};

use super::search_result_parser::SearchResultParser;

/// A new DuckDuckGo engine type defined in-order to implement the `SearchEngine` trait which allows to
/// reduce code duplication as well as allows to create vector of different search engines easily.
pub struct DuckDuckGo {
    /// The parser, used to interpret the search result.
    parser: SearchResultParser,
}

impl DuckDuckGo {
    /// Creates the DuckDuckGo parser.
    pub fn new() -> Result<Self, EngineError> {
        Ok(Self {
            parser: SearchResultParser::new(
                ".no-results",
-                ".result",
-                ".result__a",
+                ".results>.result",
+                ".result__title>.result__a",
                ".result__url",
                ".result__snippet",
            )?,
        })
    }
}

#[async_trait::async_trait]
impl SearchEngine for DuckDuckGo {
    async fn results(
        &self,
        query: &str,
        page: u32,
        user_agent: &str,
        client: &Client,
        _safe_search: u8,
    ) -> Result<HashMap<String, SearchResult>, EngineError> {
        // Page number can be missing or empty string and so appropriate handling is required
        // so that upstream server recieves valid page number.
        let url: String = match page {
            1 | 0 => {
                format!("https://html.duckduckgo.com/html/?q={query}&s=&dc=&v=1&o=json&api=/d.js")
            }
            _ => {
                format!(
                    "https://duckduckgo.com/html/?q={}&s={}&dc={}&v=1&o=json&api=/d.js",
                    query,
                    (page / 2 + (page % 2)) * 30,
                    (page / 2 + (page % 2)) * 30 + 1
                )
            }
        };

        // initializing HeaderMap and adding appropriate headers.
        let header_map = HeaderMap::try_from(&HashMap::from([
            ("USER_AGENT".to_string(), user_agent.to_string()),
            ("REFERER".to_string(), "https://google.com/".to_string()),
            (
                "CONTENT_TYPE".to_string(),
                "application/x-www-form-urlencoded".to_string(),
            ),
            ("COOKIE".to_string(), "kl=wt-wt".to_string()),
        ]))
        .change_context(EngineError::UnexpectedError)?;

        let document: Html = Html::parse_document(
            &DuckDuckGo::fetch_html_from_upstream(self, &url, header_map, client).await?,
        );

        if self.parser.parse_for_no_results(&document).next().is_some() {
            return Err(Report::new(EngineError::EmptyResultSet));
        }

        // scrape all the results from the html
        self.parser
            .parse_for_results(&document, |title, url, desc| {
                Some(SearchResult::new(
                    title.inner_html().trim(),
                    &format!("https://{}", url.inner_html().trim()),
                    desc.inner_html().trim(),
                    &["duckduckgo"],
                ))
            })
    }
}

[!Note]

  1. To get started contributing make sure to read the contributing.md file for the guidlines on how to contribute in this project
  2. To contribute first fork this project by following this video tutorial if you are not familliar with process and add your changes and make a pull request with the changes to this repository and if you are new to GitHub then follow this video tutorial to get started contributing :slightly_smiling_face: .

Screenshots

No response

Do you want to work on this issue?

Yes

Additional information

No response

github-actions[bot] commented 10 months ago

The issue has been unlocked and is now ready for dev. If you would like to work on this issue, you can comment to have it assigned to you. You can learn more in our contributing guide https://github.com/neon-mmd/websurfx/blob/rolling/CONTRIBUTING.md