pesc / searchzonech

Search the .ch zone file for DNS Records
https://searchzone.ch/
Apache License 2.0
5 stars 2 forks source link

flag/syntax to return exact matches only #5

Open gryphius opened 3 years ago

gryphius commented 3 years ago

Thanks for searchzone.ch, it is a useful tool

Is it possible to somehow disable similarity search and only return results which contain the search string exactly? For example I tried to perform "passive dns" like searches to see which ch-domains are hosted in certain ip ranges, but the results contain many unrelated results which just start with similar octets.

pesc commented 3 years ago

Hi, thanks for the feedback!

Ehmm would something like this helps? Or what's your exact use case?

curl --location --request POST 'https://api.searchzone.ch/api/as/v1/engines/domains-prod/search' \
--header 'authorization: Bearer search-fwyyo4i26hj5nruvauu3d372' \
--header 'Content-Type: application/json' \
--data-raw '{
    "search_fields": {
        "a_record": {}
    },
    "result_fields": {
        "domain": {
            "raw": {}
        }
    },
    "query": "151.101.1."
}'
gryphius commented 3 years ago

for example, if I wanted to search for domains which resolve to 2a02:168:2132::*:

curl --location --request POST 'https://api.searchzone.ch/api/as/v1/engines/domains-prod/search' \
--header 'authorization: Bearer search-fwyyo4i26hj5nruvauu3d372' \
--header 'Content-Type: application/json' \
--data-raw '{
    "search_fields": {
        "aaaa_record": {}
    },
    "result_fields": {
        "domain": {
            "raw": {}
        },
        "aaaa_record": {
            "raw": {}
        }
    },
    "query": "2a02:168:2132:"
}'

however, this currently also returns "similar" records, such as:

[...]
   {
      "domain": {
        "raw": "sayari.ch"
      },
      "aaaa_record": {
        "raw": [
          "2a02:168:be04::42"
        ]
      },
      "_meta": {
        "id": "sayari.ch",
        "engine": "domains-prod",
        "score": 5.4933805
      },
      "id": {
        "raw": "sayari.ch"
      }
    },
    {
      "domain": {
        "raw": "alainwolf.ch"
      },
      "aaaa_record": {
        "raw": [
          "2a02:168:f405::42"
        ]
      },
      "_meta": {
        "id": "alainwolf.ch",
        "engine": "domains-prod",
        "score": 5.4933805
      },
      "id": {
        "raw": "alainwolf.ch"
      }
    }

i.e. the aaaa record does not contain 2a02:168:2132

similarly, if I search for "picantepizza", I get tons of results which contain the word "pizza" but not necessarily "picatepizza", such as:

ristorantepizzerialafortuna.ch
ns1.hostserv.eu. info.computrade.ch. 2020101002 7200 120 2419200 10800
185.178.193.95
ns2.hostserv.eu.
ns1.hostserv.eu.
ns3.hostserv.eu.
mail.ristorantepizzerialafortuna.ch.

so, what I was hoping for is an option in the GUI/API to only return results which contain the full search string, and not perform any similarity searches.

pesc commented 3 years ago

Alright, let me take a look on it on the weekend or evening. I guess it has to do how Elasticsearch is indexing this field...

pesc commented 3 years ago

I've checked it and it seems a problem how the data gets indexed with ElasticSearch. I have contacted the ElasticSearch team how to solve it with the AppSearch I'm using under the hood. Will update if I get a solution from their side...

pesc commented 3 years ago

Sorry for the long delay. I'm quite busy with school and work. Sadly there was no progress from Elastic side: https://discuss.elastic.co/t/precise-regex-search/266141/4

I'll try to fix and reindex the data on the weekend...

gryphius commented 3 years ago

no worries, thanks for the update!

pesc commented 3 years ago

Ok, it's a product limitation of AppSearch (may be added in a future version).

Anyway, I planed to create a REST-API that queries the ElasticSearch backend. With that implemented it will be possible.

For example:

{
    "_source": [
        "domain$string"
    ],
    "query": {
        "prefix": {
            "aaaa_record$string": {
                "value": "2a02:168:2132:"
            }
        }
    }
}

or

{
    "_source": [
        "domain$string"
    ],
    "query": {
        "wildcard": {
            "aaaa_record$string": "2a02:168:2132:*"
        }
    }
}

Which currently result in 8 matches, possible? 🤔

My semester ends soon, hopefully I'll find some time to continue with the project.

pesc commented 3 years ago

So, for testing purpose you can use this endpoint. Syntax is the elastic Search API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html

Currently it isn't documented on my side - and I'm not sure if I leave it like this (security, ...) - but if you need help with the syntax and fields let me know.

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id,hits.hits._source.aaaa_record' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": {
    "prefix": {
      "aaaa_record.enum": {
        "value": "2a02:168:2132:"
      }
    }
  }
}'

Resulting in:

{
    "hits": {
        "total": {
            "value": 8
        },
        "hits": [
            {
                "_id": "opteamal.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "organicbodycare.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "organic-body-care.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "hadornag.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "host-bliss.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "chromos.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "onlineshophosting.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "websitedesign.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            }
        ]
    }
}
gryphius commented 3 years ago

Works very well, thanks! Apart from the "passive dns" use case this enables other interesting searches like "give me all domains with null MX" :+1:

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id,hits.hits._source.mx_record' --header 'Content-Type: application/json' --data-raw '{
  "query": {
    "prefix": {
      "mx_record.enum": {
        "value": "."
      }
    }
  }
}'
{
  "hits" : {
    "total" : {
      "value" : 1845
    },
 [...]

No worries about the stable API - if you have to make changes/disable for security reasons that's obviously understandable.

pesc commented 3 years ago

nothing easier than this ;)

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id&size=10000' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": {
    "term": {
        "mx_valid.enum": false
      }
    }
}'

Keep in mind elasticsearch returns 10000 results per query, check the https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html for more results!

For each record I have the [type]_record & [type]_valid (true = it exists) field. My elasticsearch mapping got a little messed up with the last upgrade, have to review it later....

So currently I have these records:

pesc commented 3 years ago

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id,hits.hits._source.mx_record' --header 'Content-Type: application/json' --data-raw '{ "query": { "prefix": { "mx_record.enum": { "value": "." } } } }'

Ohh I may understood you wrong - https://datatracker.ietf.org/doc/html/rfc7505 😁 but still I hope my comment above helps