sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
477 stars 79 forks source link

suggestions for mastiff #2254

Open ctb opened 2 years ago

ctb commented 2 years ago

A few thoughts on mastiff/SRA-search-as-a-service moving forward -

note the "we" here is really @luizirber :). although the pressure to learn Rust continues to increase.

luizirber commented 2 years ago

we are likely to want to provide multiple databases / versions of databases down the road - in particular, real-time search of GTDB and Genbank genomes would be ideal. would be good if mastiff can allow for this now, before we start getting regular users.

I can version the API call (instead of /search be /v1/search), and more parameters can be passed in (like which DB/version to search). I should have done that for mastiff, but time crunch :upside_down_face: (wort already does that, and has the API described in the OpenAPI format).

would be good to track heavy users and Web sites - not 100% sure how to do this. perhaps if we require API keys but make it easy to get them, then that would be good? otherwise we're going to be stuck figuring things out from referrer logs, and/or banning people who decide to overwhelm the server

I was thinking about using some rate limiter in Caddy, and add logic to deal with 429 in the mastiff CLI client.

For monitoring I use datadog for wort because it is easy to deploy.

API keys are a good idea, but involve more info about the user being stored (create accounts and so on), which complicate the service quite a bit. So I would avoid that for now =]

ctb commented 2 years ago

API keys are a good idea, but involve more info about the user being stored (create accounts and so on), which complicate the service quite a bit. So I would avoid that for now =]

agree, but on the contrapositive -

maybe support or require a contact e-mail (or something) somewhere so that we can backtrack from logs?

I'm only suggesting this all because I've seen what happens when people run services that become popular :)

while I'm suggesting random features - would be cool to support a manifest-style output. although I think the current format does a fine job as a picklist so maybe we don't need it.

ctb commented 2 years ago

oh, and a request for further reporting information - could we get the number of overlapping hashes in addition to the containment? :)