typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

No option to create non-nested attributes #56

Open attila-csaszar opened 6 months ago

attila-csaszar commented 6 months ago

Description

The scraper automatically creates fields with nested names, such as hierarcy.lvl0, hierarchy.lvl1 etc. I use the index in React Instantsearch on a Docusaurus site, where highlighting does not work with nested attributes. I currently use a cumbersome workaround, meaning that I export the all documents from the generated Typesense collection, do a search replace to remove the nesting, then import it into a collection with a custom schema without nested attributes. I have tried to simply define the attribute names in custom field descriptions, but this does not help, as the scraper leaves the non-nested attribute fields empty and still produces the same nested attributes.

Expected Behavior

It would be nice to have the option to create an index without any nested attributes. Thanks in advance!

Metadata

Typesense Version: 0.25.1

OS: MacOS

jasonbosco commented 6 months ago

I use the index in React Instantsearch on a Docusaurus site, where highlighting does not work with nested attributes.

That sounds odd. Could you elaborate on this?

This site for eg, uses the Docusaurus Typesense theme and highlighting seems to work fine there in the search results: https://orkes.io/content

attila-csaszar commented 6 months ago

Honestly, not sure why. I've built a custom search page within my docusaurus site using React Instantsearch. It's nice to work with react on docusaurus, because the whole search page works out of a single jsx file. I'll paste the file content here:

import React from 'react';
import Layout from '@theme/Layout';
import {
  InstantSearch,
  Configure,
  InfiniteHits,
  SearchBox,
  ClearRefinements,
  Highlight,
  RefinementList,
  Stats
} from 'react-instantsearch-hooks-web';
import TypesenseInstantSearchAdapter from "typesense-instantsearch-adapter";
import './advanced-search.css';

const typesenseInstantsearchAdapter = new TypesenseInstantSearchAdapter({
  server: {
    apiKey: "*********************", // Be sure to use an API key that only allows search operations
    nodes: [
      {
        host: "3k7iy4toh0we8z5lp.a1.typesense.net",
        port: "443",
        protocol: "https",
      },
    ],
    cacheSearchResultsForSeconds: 0, // Cache search results from server. Defaults to 2 minutes. Set to 0 to disable caching.
  },
  // The following parameters are directly passed to Typesense's search API endpoint.
  //  So you can pass any parameters supported by the search endpoint below.
  //  query_by is required.
  additionalSearchParameters: {
    query_by: "lvl1, lvl0, lvl2, lvl3, lvl4, lvl5, content",
    numTypos: 1,
    exhaustive_search: true,
    max_facet_values: 30,
    drop_tokens_threshold: 0,
  },
});
const searchClient = typesenseInstantsearchAdapter.searchClient;

function App() {
  return (
    <Layout>
      <div>
        <div className="container">
          <InstantSearch
            searchClient={searchClient}
            indexName="nevisdocslive"
          >
            <Configure hitsPerPage={8} />
            <div className="search-panel">
              <div className="search-panel__filters">
                <div class='refinements-panel'>
                  <span class='ais-Panel-header'>Product or component</span>
                  <RefinementList
                    attribute="lvl0"
                    searchable
                    showMore
                    limit={8}
                    facetOrdering
                    searchablePlaceholder= 'Search for products'
                  />
                </div>
                <div class='refinements-panel'>
                  <span class='ais-Panel-header'>Version</span>
                  <RefinementList 
                    attribute="version"
                    searchable
                    showMore
                    limit={8}
                    facetOrdering
                    searchablePlaceholder= 'Search for versions'
                  />
                </div>
                <ClearRefinements />
              </div>

              <div className="search-panel__results">
                <div class="search-box">
                  <SearchBox
                    className="searchbox"
                    searchablePlaceholder= ''
                  />
                  <div class="ais-Panel-header">
                    <Stats />
                  </div>
                </div>
                <InfiniteHits
                  hitComponent={Hit}
                  showPrevious={false}
                />
              </div>
            </div>
          </InstantSearch>
        </div>
      </div>
    </Layout>
  );
}

const Hit = ({ hit }) => {
  const urlText = hit.url;

  return (
    <div>
      <a class="hit_title_link" href={urlText}>
        <Highlight hit={hit} attribute="lvl1" />
        <div class="heading_attributes">
          <Highlight hit={hit} attribute="lvl2" /> > <Highlight hit={hit} attribute="lvl3" />
        </div>
      </a>
      <div class="hit_product_tag">
        <b><Highlight hit={hit} attribute="lvl0" />  </b>  
        <Highlight hit={hit} attribute="version" />
      </div>
      <div class="hit_content">
        <Highlight hit={hit} attribute="lvl4" />
      </div>
      <div class="hit_content_code">
        <Highlight hit={hit} attribute="lvl5" />
      </div>
      <div class="hit_content">
        <Highlight hit={hit} attribute="content" />
      </div>
    </div>
  );
};

export default App;

This is as it works correctly now, so no nested attributes. Unfortunately I was unable to find any real explanation or helpful threads on why nested attributes break the highlighting function, but they do.

jasonbosco commented 6 months ago

The scraper automatically creates fields with nested names, such as hierarcy.lvl0, hierarchy.lvl1

This was actually the behavior in an older version of the scraper. Could you upgrade to the latest version of the scraper (typesense/docsearch-scraper:0.9.1), run it against your documentation site and then your document structure should have a top level key called hierarchy and then sub-fields inside a nested object like this example:

{
  "anchor": "query-suggestions",
  "hierarchy": {
    "lvl0": "Query Suggestions",
    "lvl1": null,
    "lvl2": null,
    "lvl3": null,
    "lvl4": null,
    "lvl5": null,
    "lvl6": null
  },
  "hierarchy.lvl0": "Query Suggestions",
  "url": "https://typesense.org/docs/guide/query-suggestions.html#query-suggestions"
}

If that also doesn't work, I would recommend using the hit object directly instead of using the <Highlight /> component.