sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
462 stars 65 forks source link

Webserver umap: zero-size array to reduction operation maximum which has no identity #107

Open konstin opened 3 years ago

konstin commented 3 years ago

When I tried to use the SeqVec pipeline of the webserver with the seqence-protein example, I got an exception in the umap part:

[2020-12-30 15:52:28,294: INFO/MainProcess] Received task: webserver.tasks.pipeline.run_pipeline[0388a330f2c34715a07c0a56c7c5fe14]  
[2020-12-30 15:52:29,786: INFO/MainProcess] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[2020-12-30 15:52:29,892: INFO/MainProcess] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[2020-12-30 15:52:29,894: INFO/MainProcess] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[2020-12-30 15:52:30,015: INFO/MainProcess] instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
[2020-12-30 15:52:30,015: INFO/MainProcess] instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
[2020-12-30 15:52:30,016: INFO/MainProcess] instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
[2020-12-30 15:52:30,016: INFO/MainProcess] instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
[2020-12-30 15:52:30,362: INFO/MainProcess] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[2020-12-30 15:52:31,545: INFO/MainProcess] ------ Starting pipeline execution...
[2020-12-30 15:52:31,545: INFO/MainProcess] Created the prefix directory /tmp/tmp81rnzfl4/bio_embeddings_job
[2020-12-30 15:52:31,545: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/input_parameters_file.yml
[2020-12-30 15:52:31,553: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/sequences_file.fasta
[2020-12-30 15:52:31,554: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/mapping_file.csv
[2020-12-30 15:52:31,554: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/remapped_sequences_file.fasta
[2020-12-30 15:52:31,555: INFO/MainProcess] Created the stage directory /tmp/tmp81rnzfl4/bio_embeddings_job/seqvec_embeddings
[2020-12-30 15:52:31,556: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/seqvec_embeddings/input_parameters_file.yml
[2020-12-30 15:52:31,558: INFO/MainProcess] CUDA NOT available, using the CPU. This is slow
[2020-12-30 15:52:31,558: INFO/MainProcess] Initializing ELMo.
[2020-12-30 15:52:36,126: INFO/MainProcess] Running ELMo warmup
[2020-12-30 15:52:44,194: INFO/MainProcess] The minimum expected size for the reduced_embedding_file is 0.008MB.
[2020-12-30 15:52:44,194: INFO/MainProcess] The minimum expected size for the embedding_file is 0.184MB.
[2020-12-30 15:52:44,194: INFO/MainProcess] You are going to generate a total of 0.193MB of embeddings, and have 236428.358MB available at /tmp/tmp81rnzfl4/bio_embeddings_job.
[2020-12-30 15:52:44,195: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/seqvec_embeddings/embeddings_file.h5
[2020-12-30 15:52:44,196: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/seqvec_embeddings/reduced_embeddings_file.h5
[2020-12-30 15:52:44,197: WARNING/MainProcess] 0%|          | 0/2 [00:00<?, ?it/s]
[2020-12-30 15:52:44,463: WARNING/MainProcess] 50%|#####     | 1/2 [00:00<00:00,  3.76it/s]
[2020-12-30 15:52:44,464: WARNING/MainProcess] 50%|#####     | 1/2 [00:00<00:00,  3.74it/s]
[2020-12-30 15:52:44,465: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/seqvec_embeddings/ouput_parameters_file.yml
[2020-12-30 15:52:44,468: INFO/MainProcess] Copying embeddings_file to database.
[2020-12-30 15:52:44,472: INFO/MainProcess] Copying reduced_embeddings_file to database.
[2020-12-30 15:52:44,475: INFO/MainProcess] Copying mapping_file to database.
[2020-12-30 15:52:44,477: INFO/MainProcess] Created the stage directory /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec
[2020-12-30 15:52:44,477: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec/input_parameters_file.yml
[2020-12-30 15:52:44,506: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec/DSSP3_predictions_file.fasta
[2020-12-30 15:52:44,507: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec/DSSP8_predictions_file.fasta
[2020-12-30 15:52:44,507: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec/disorder_predictions_file.fasta
[2020-12-30 15:52:44,507: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec/per_sequence_predictions_file.csv
[2020-12-30 15:52:44,517: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/annotations_from_seqvec/ouput_parameters_file.yml
[2020-12-30 15:52:44,521: INFO/MainProcess] Copying embeddings_file to database.
[2020-12-30 15:52:44,524: INFO/MainProcess] Copying reduced_embeddings_file to database.
[2020-12-30 15:52:44,526: INFO/MainProcess] Copying DSSP3_predictions_file to database.
[2020-12-30 15:52:44,527: INFO/MainProcess] Copying DSSP8_predictions_file to database.
[2020-12-30 15:52:44,529: INFO/MainProcess] Copying disorder_predictions_file to database.
[2020-12-30 15:52:44,531: INFO/MainProcess] Copying per_sequence_predictions_file to database.
[2020-12-30 15:52:44,533: INFO/MainProcess] Copying mapping_file to database.
[2020-12-30 15:52:44,535: INFO/MainProcess] Created the stage directory /tmp/tmp81rnzfl4/bio_embeddings_job/umap_projections
[2020-12-30 15:52:44,535: INFO/MainProcess] Created the file /tmp/tmp81rnzfl4/bio_embeddings_job/umap_projections/input_parameters_file.yml
[2020-12-30 15:52:44,541: WARNING/MainProcess] UMAP(a=None, angular_rp_forest=True, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='cosine',
     metric_kwds=None, min_dist=0.6, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=420, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, transform_seed=42, unique=False, verbose=1)
[2020-12-30 15:52:44,542: WARNING/MainProcess] /home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/umap/umap_.py:1678: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[2020-12-30 15:52:44,542: WARNING/MainProcess] Construct fuzzy simplicial set
[2020-12-30 15:52:44,734: WARNING/MainProcess] Wed Dec 30 15:52:44 2020
[2020-12-30 15:52:44,734: WARNING/MainProcess] Finding Nearest Neighbors
[2020-12-30 15:52:46,975: WARNING/MainProcess] Wed Dec 30 15:52:46 2020
[2020-12-30 15:52:46,975: WARNING/MainProcess] Finished Nearest Neighbor Search
[2020-12-30 15:52:48,770: WARNING/MainProcess] Wed Dec 30 15:52:48 2020
[2020-12-30 15:52:48,770: WARNING/MainProcess] Construct embedding
[2020-12-30 15:52:48,793: ERROR/MainProcess] Task webserver.tasks.pipeline.run_pipeline[0388a330f2c34715a07c0a56c7c5fe14] raised unexpected: ValueError('zero-size array to reduction operation maximum which has no identity')
Traceback (most recent call last):
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/sentry_sdk/integrations/celery.py", line 197, in _inner
    reraise(*exc_info)
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/sentry_sdk/_compat.py", line 54, in reraise
    raise value
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/sentry_sdk/integrations/celery.py", line 192, in _inner
    return f(*args, **kwargs)
  File "/home/konsti/bio_embeddings/webserver/tasks/pipeline.py", line 86, in run_pipeline
    execute_pipeline_from_config(config, post_stage=_post_stage_save)
  File "/home/konsti/bio_embeddings/bio_embeddings/utilities/pipeline.py", line 200, in execute_pipeline_from_config
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/home/konsti/bio_embeddings/bio_embeddings/project/pipeline.py", line 125, in run
    return PROTOCOLS[kwargs["protocol"]](**kwargs)
  File "/home/konsti/bio_embeddings/bio_embeddings/project/pipeline.py", line 75, in umap
    projected_embeddings = umap_reduce(reduced_embeddings, **kwargs)
  File "/home/konsti/bio_embeddings/bio_embeddings/project/umap.py", line 16, in umap_reduce
    transformed_embeddings = UMAP(**umap_params).fit_transform(embeddings)
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/umap/umap_.py", line 2014, in fit_transform
    self.fit(X, y)
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/umap/umap_.py", line 1965, in fit
    self.embedding_ = simplicial_set_embedding(
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/umap/umap_.py", line 1024, in simplicial_set_embedding
    graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0
  File "/home/konsti/bio_embeddings/.venv/lib/python3.8/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
sacdallago commented 3 years ago

Hmm, have to look this one up. Could you send the sequence file?

konstin commented 3 years ago

https://github.com/sacdallago/bio_embeddings/blob/develop/test-data/seqwence-protein.fasta

98jxy commented 2 years ago

Hi, have you solved this problem?