sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

Caching, ma' old friend #102

Closed sacdallago closed 3 years ago

sacdallago commented 3 years ago

Currently, although below the seuqence length (2k), some sequences aren't processed by the webserver because mongo refuses to store their embeddings (especially the case with seqvec, since Lx1024x3):

62.216.202.215 - - [14/Dec/2020:21:44:26 +0000] "POST /api/annotations HTTP/1.0" 500 37 "https://api.bioembeddings.com/api/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:84.0) Gecko/20100101 Firefox/84.0"
[2020-12-14 21:44:48,511] ERROR in app: Exception on /api/annotations [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.8/site-packages/flask_restx/api.py", line 375, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/flask/views.py", line 89, in view
    return self.dispatch_request(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/flask_restx/resource.py", line 44, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/app/webserver/endpoints/annotations.py", line 147, in post
    return _get_annotations_from_params(params)
  File "/app/webserver/endpoints/annotations.py", line 30, in _get_annotations_from_params
    annotations = get_features(model_name, sequence)
  File "/app/webserver/endpoints/task_interface.py", line 70, in get_features
    embeddings = get_embedding(model_name, sequence)
  File "/app/webserver/endpoints/task_interface.py", line 43, in get_embedding
    get_embedding_cache.insert_one(
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 698, in insert_one
    self._insert(document,
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 613, in _insert
    return self._insert_one(
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 602, in _insert_one
    self.__database.client._retryable_write(
  File "/usr/local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1498, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/usr/local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
    return self._retry_internal(retryable, func, session, bulk)
  File "/usr/local/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1416, in _retry_internal
    return func(session, sock_info, retryable)
  File "/usr/local/lib/python3.8/site-packages/pymongo/collection.py", line 590, in _insert_command
    result = sock_info.command(
  File "/usr/local/lib/python3.8/site-packages/pymongo/pool.py", line 699, in command
    self._raise_connection_failure(error)
  File "/usr/local/lib/python3.8/site-packages/pymongo/pool.py", line 683, in command
    return command(self, dbname, spec, slave_ok,
  File "/usr/local/lib/python3.8/site-packages/pymongo/network.py", line 135, in command
    message._raise_document_too_large(
  File "/usr/local/lib/python3.8/site-packages/pymongo/message.py", line 1085, in _raise_document_too_large
    raise DocumentTooLarge("BSON document too large (%d bytes)"
pymongo.errors.DocumentTooLarge: BSON document too large (19121192 bytes) - the connected server supports BSON document sizes up to 16793598 bytes.

Researching this issue, I found this answer: https://stackoverflow.com/a/4667728

Especially worrying, to me, is the idea that when the cache is quried, it's entirely copied to RAM (did I get that right? Is that really so?! If so: we should definitely move away from BSON and rather move to another gridfs storage -- should be straightforward, just StreamIO the data)

sacdallago commented 3 years ago

Note to self: data to reproduce

POST /annotations on swagger (https://api.bioembeddings.com/api/) with:

{"model":"seqvec","sequence":"MDKFWWHTAWGLCLLQLSLAHQQIDLNVTCRYAGVFHVEKNGRYSISRTEAADLCQAFNSTLPTMDQMKLALSKGFETCRYGFIEGNVVIPRIHPNAICAANHTGVYILVTSNTSHYDTYCFNASAPPEEDCTSVTDLPNSFDGPVTITIVNRDGTRYSKKGEYRTHQEDIDASNIIDDDVSSGSTIEKSTPEGYILHTYLPTEQPTGDQDDSFFIRSTLATIASTVHSKSHAAAQKQNNWIWSWFGNSQSTTQTQEPTTSATTALMTTPETPPKRQEAQNWFSWLFQPSESKSHLHTTTKMPGTESNTNPTGWEPNEENEDETDTYPSFSGSGIDDDEDFISSTIATTPRVSARTEDNQDWTQWKPNHSNPEVLLQTTTRMADIDRISTSAHGENWTPEPQPPFNNHEYQDEEETPHATSTTPNSTAEAAATQQETWFQNGWQGKNPPTPSEDSHVTEGTTASAHNNHPSQRITTQSQEDVSWTDFFDPISHPMGQGHQTESKDTDSSHSTTLQPTAAPNTHLVEDLNRTGPLSVTTPQSHSQNFSTLHGEPEEDENYPTTSILPSSTKSSAKDARRGGSLPTDTTTSVEGYTFQYPDTMENGTLFPVTPAKTEVFGETEVTLATDSNVNVDGSLPGDRDSSKDSRGSSRTVTHGSELAGHSSANQDSGVTTTSGPMRRPQIPEWLIILASLLALALILAVCIAVNSRRRCGQKKKLVINGGNGTVEDRKPSELNGEASKSQEMVHLVNKEPSETPDQCMTADETRNLQSVDMKIGV","format":"full"}

On webserver: docker logs --follow bio_embeddings_webserver

sacdallago commented 3 years ago

Related: