searxng / searx-space

Statistics of the public SearX(NG) instances
https://searx.space/
GNU Affero General Public License v3.0
159 stars 25 forks source link

move list of searx-instances from searx to searx-stats2 #7

Closed return42 closed 4 years ago

return42 commented 4 years ago

Started with commit 200c3a31 from PR 1791 the list of public searx instances moved to the documentation tree. If this PR is merged, the SEARX_INSTANCES_URL has to be changed to the new location.

TL;DR; all discussed here are ended in #12 and #13

dalf commented 4 years ago

Some thoughts about the instance list:

return42 commented 4 years ago

See https://github.com/asciimoo/searx/pull/1791#issuecomment-571137584

First, lets move the wiki entry to the docs folder, this will also close a bug at searx.

return42 commented 4 years ago

FYI: I removed the URLs from the wiki entry at searx: https://github.com/asciimoo/searx/wiki/Searx-instances

fgossel commented 4 years ago

I'm just curious. What is actually the reason for keeping the list of offline instances? Most of them will be never back on again, I guess.

dalf commented 4 years ago

What is actually the reason for keeping the list of offline instances?

Right now, the list lack some maintenance (all sections, including online, incorrect SSL certificate sections). As long the instance list was stored in the wiki, it was difficult to do this maintenance.

Note: It would be easier if searx-stats2 would record the last time the instance was seen online.

dalf commented 4 years ago

How to store the instance list is linked to workflows to add / update / delete items in this list.

I guess the main workflows are:


Here some ideas how to store the instance list:

  1. one yaml file modified with github PR or patch sent by email. (https://github.com/asciimoo/searx/issues/1785#issuecomment-571682651)
    • possible merge conflict for the PR: an issue can be used instead of a PR (but see the last solution).
    • an order can be enforced (a post-commit script can check this).
  2. one yaml file per instance:
    • easy to keep the track of the history per instance.
    • no conflict compare to the first solution
    • https / onion URLs can be in the same .yaml file.
    • perhaps, a file name convention must be defined.
  3. a database
    • it is possible to use OAuth from different providers.
    • it requires a server rather than a few static HTML pages.
  4. one github issue per instance in a dedicated project:
    • use issue template.
    • allow a comment thread per instance.
    • a label set by the moderators for the approved instances.
    • other labels can be added by moderators (tracker, etc...)

About the first solution, a yaml format:

- url: str # mandatory, https URL
  addtional_urls: # optional
   - url: str # searx instance URL (example https://search.gibberfish.org/tor/ )
   - relation: str # comment about the link (example _Proxied through Tor_ )
  comments: str # optional, str or a list of str (?)
  unsafe: bool # optional, see https://github.com/dalf/searx-stats2/issues/6

Example:

- url: "https://search.gibberfish.org/"
  addtional_urls:
    - url: "http://o2jdk5mdsijm2b7l.onion/"
      relation: "Hidden Service"   
    - url: "https://search.gibberfish.org/"
      relation: "Proxied through Tor"
dalf commented 4 years ago

Question: should the instance list be in this git repository or another one ?

Why another repository:

On the downside, it is another repository to manage.

return42 commented 4 years ago

May be I was unclear. I want to replace the lists below https://asciimoo.github.io/searx/user/public_instances.html#alive-and-running with a paragraph similar:

At https://searx.space you will find a list of public instances. If you want to see your searx instance added or removed from https://searx.space/ list, please add a comment to issue https://github.com/dalf/searx-stats2/issues/12

By this, It is up to searx-stats2 how to maintain the (internal) list, no need for a separate maintained list.

dalf commented 4 years ago

Note: for now, searx-stats2 scrapes the searx github repository few times per day.

It is up to searx-stats2 how to maintain the (internal) list, no need for a separate maintained list.

👍

My previous comments tries to talk about the "how to store and manage this list ?" question. My wish is to make sure we all agree about the way the instance list is managed, that's why I put some answers on the table:

Why not about the central issue. Question: wouldn't be difficult to follow the add / remove requests ? Perhaps we can a 👍 (or 👀 ) to the comments that have been processed (and add a notice about that).

About emails: I prefer a mailing list rather receiving emails directly. I can create something like request at searx . space (gandi mail).

Note a mailing list also exists : https://github.com/asciimoo/searx/issues/578

@unixfox @asciimoo > what are your view points ?

return42 commented 4 years ago

for now, searx-stats2 scrapes https://raw.githubusercontent.com/asciimoo/searx/master/dpublic_instances.rst

Really? For what is SEARX_INSTANCES_URL needed? (sorry if question is dump, I haven't looked through the whole sources).

Lighter / simpler: a yaml file in the searx-stats2 ... People sends a PR to update the file.

is what I vote for

Question: wouldn't be difficult to follow the add / remove requests ?

Adding a link to the commit message should be enough to track.

mailing list

is dead

Do not try to make it perfect from the beginning: 80/20 rule

Most often it is better to establish a simple workflow initial and when you see it fails under some aspects in practical usage, you are able to fix/optimize your workflows with the experience from the practice.

rather receiving emails directly

That's OK, adding issue comment should be enough to start (BTW I modified #12 that way).

dalf commented 4 years ago

Really? For what is SEARX_INSTANCES_URL needed? (sorry if question is dump, I haven't looked through the whole sources).

https://github.com/dalf/searx-stats2/blob/master/searxstats/source/searx_docs.py#L6 I haven't delete the previous code.

Do not try to make it perfect from the beginning: 80/20 rule

Sure, but:

I'm okay with #12 solution.

BTW, I've created #13

unixfox commented 4 years ago

I think it should be better if we have a dedicated issue template than having a general issue because :

dalf commented 4 years ago

@unixfox > make sense.

In this case, the issues about the instances and the one about the code will merge in one big list. I think it will be confusing ?

Labels can be a way to solve this :

Another way is to create an additional github repository. The user rights can be different between this project and the new one.

return42 commented 4 years ago

The repository and the commits do matter, github dependencies only reduce the degrees of freedom.

@dalf you are the master of searx-stats2 and the decision is up to you. I can only repeat myself: lets keep things simple and have progress.

dalf commented 4 years ago

Why the instance list hosted by the wiki was a problem ? As I understand, anyone could modified the content, especially delete an instance without notice.

The solution here is to add an human review:

How to review a delete request ? Should the request to add the instance and the request to delete the instance come from the same github account ? If it comes from a different account, I don't know to deal with it.


Here a solution:

When a reviewer accepts the change, the instance list is modified with a commit (no need for PR) : reviewer are trusted to make good commit message.

The draw backs:

@return42 : it is basically you have suggested except there is an issue per request instead of a long list. I think it makes the reviewer life easier.

unixfox commented 4 years ago

I thought about something like letsencrypt: a HTTP challenge to add the searx instance, a deny entry in the robots.txt to delete the instance, all manage automatically by a setting in settings.yml but that's a heavy solution

Why not instead a TXT entry in the DNS?

dalf commented 4 years ago

Why not instead a TXT entry in the DNS?

With the HTTP challenge / robots.txt solution, searx code can deal with it automatically:

The DNS solution requires another layer of complexity: most probably it requires a "check my DNS configuration" step in searx-stats2.

Anyway, both can be implemented, but each requires a database and a web server.

Are you saying that you prefer this solution to the ".yaml file + github issues" solution ?

dalf commented 4 years ago

So here a proposal:

usage: update.py [-h]
                 [--github-issues [GITHUB_ISSUE_LIST [GITHUB_ISSUE_LIST ...]]]
                 [--add [ADD_INSTANCES [ADD_INSTANCES ...]]]
                 [--delete [DELETE_INSTANCES [DELETE_INSTANCES ...]]]
                 [--edit [EDIT_INSTANCES [EDIT_INSTANCES ...]]]

Update the instance list according to the github issues.

optional arguments:
  -h, --help            show this help message and exit
  --github-issues [GITHUB_ISSUE_LIST [GITHUB_ISSUE_LIST ...]]
                        Github issue number to process, by default all
  --add [ADD_INSTANCES [ADD_INSTANCES ...]]
                        Add instance(s)
  --delete [DELETE_INSTANCES [DELETE_INSTANCES ...]]
                        Delete instance(s)
  --edit [EDIT_INSTANCES [EDIT_INSTANCES ...]]
                        Edit instance(s)

The tool :

--github-issues reads the github issues.

There are issue templates : https://github.com/dalf/searx-instances/issues/new/choose

So:

An example what is shown in the default editor:

https://nibblehole.com:
  safe: false

# Add https://nibblehole.com
#
# Close https://github.com/dalf/searx-instances/issues/2
# From @dalf

#> The above text is the commit message
#> Delete the whole buffer to cancel the request

#> -- MESSAGE -----------------------
#> See https://github.com/asciimoo/searx/pull/1818

Here is it possible to modify the yaml, the commit message and validate or delete the whole buffer to cancel.

Note: this tool is not mandatory, it is only an helper.


searx-stats integration: pip install does not update package referenced on a git repository. So here the PR #16 which basically git clone https://github.com/dalf/searx-instances/ or git pull on each run, and make an ugly change of the PYTHONPATH.

dalf commented 4 years ago

The PR #16 has been merged. The instance list is hosted here: https://github.com/dalf/searx-instances/

You can see the result in https://searx.space/ In the top right corner, the Show comments checkbox allows to display something similar to https://asciimoo.github.io/searx/user/public_instances.html with the exceptions of the "Useful information" and the "Meta-searx instances" sections.

return42 commented 4 years ago

@dalf excellent work, much more than I ever expected / thanks a lot!!