nestauk / old_nesta_daps

[archived]
MIT License
18 stars 5 forks source link

[293] Transform arXiv abstracts to vectors and apply soft clustering #295

Closed kstathou closed 4 years ago

kstathou commented 4 years ago

Metaflow pipeline that fetches arxiv articles from a SQL database, transforms their abstracts to vectors with Sentence Transformers and clusters them with HDBSCAN. You have the option to either fit a new HDBSCAN model or use an existing one.

It produces three outputs:

I opted for:

You can run the pipeline as follows (details of these parameters in the doc2cluster.py):

  1. Run the pipeline and fit a new HDBSCAN model python doc2cluster.py --no-pylint run --transformer distilbert-base-nli-stsb-mean-tokens --db_config mysqldb.config --min_cluster_size 10 --min_samples 2 --new_clusterer True --clusterer_name hdbscan_model --s3_bucket <MY_BUCKET_NAME>
  2. Run the pipeline and predict a cluster distribution for vectors using a fitted HDBSCAN model python doc2cluster.py --no-pylint run --transformer distilbert-base-nli-stsb-mean-tokens --db_config mysqldb.config --new_clusterer False --clusterer_name hdbscan_model --s3_bucket <MY_BUCKET_NAME>

I also added the .metaflow/ directory in .gitignore.

jaklinger commented 4 years ago

Hi @kstathou thanks for that - it all looks good to me - just some minor stylistics points so far. I'll do a test run once you've implemented those fixes, and the travis tests are working. Thanks!

kstathou commented 4 years ago

Thanks for the comments @jaklinger ! I made the changes, let me know if anything else is needed.

jaklinger commented 4 years ago

Great thanks for that! I've just put in a couple of changes to get it running on MySQL.

It's running now, but I'll leave the pipeline running to see how long it takes.

jaklinger commented 4 years ago

@kstathou I get the following error. Do you know how I should fix it?

2020-07-31 17:54:13.063 [1596210356438338/transform/2 (pid 68363)] WARNING:root:You try to use a model that was created with version 0.3.2, however, your version is 0.3.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)] 
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)] 
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)] 
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)] <flow Doc2ClusterFlow step transform> failed:
2020-07-31 17:54:13.071 [1596210356438338/transform/2 (pid 68363)] Internal error
2020-07-31 17:54:13.071 [1596210356438338/transform/2 (pid 68363)] Traceback (most recent call last):
2020-07-31 17:54:14.373 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/cli.py", line 857, in main
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] start(auto_envvar_prefix='METAFLOW', obj=state)
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 764, in __call__
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] return self.main(args, kwargs)
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 717, in main
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] rv = self.invoke(ctx)
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 956, in invoke
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return ctx.invoke(self.callback, ctx.params)
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 555, in invoke
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return callback(args, kwargs)
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return f(get_current_context().obj, args, kwargs)
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/cli.py", line 432, in step
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] max_user_code_retries)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/task.py", line 393, in run_step
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] self._exec_step_function(step_func)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/task.py", line 47, in _exec_step_function
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] step_function()
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "doc2cluster.py", line 156, in transform
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] self.embeddings = model.encode(self.abstracts)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 124, in encode
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] sentences_tokenized = [self.tokenize(sen) for sen in sentences]
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 124, in <listcomp>
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] sentences_tokenized = [self.tokenize(sen) for sen in sentences]
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 184, in tokenize
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] return self._first_module().tokenize(text)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 48, in tokenize
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] return self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text))
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 282, in tokenize
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/re.py", line 194, in sub
2020-07-31 17:54:14.401 [1596210356438338/transform/2 (pid 68363)] return _compile(pattern, flags).sub(repl, string, count)
2020-07-31 17:54:14.401 [1596210356438338/transform/2 (pid 68363)] TypeError: expected string or bytes-like object
2020-07-31 17:54:14.401 [1596210356438338/transform/2 (pid 68363)] 
2020-07-31 17:54:14.404 [1596210356438338/transform/2 (pid 68363)] Task failed.
2020-07-31 17:54:14.404 Workflow failed.
2020-07-31 17:54:14.404 Terminating 0 active tasks...
2020-07-31 17:54:14.404 Flushing logs...
    Step failure:
    Step transform (task-id 2) failed.
kstathou commented 4 years ago

@jaklinger I cannot access the database (I guess my IP changed :( ) My guess is that there are papers with missing abstracts. I am now filtering them in the start step. Can you check if it works now? I also bumped sentence-transformers to 0.3.2 to remove the warning at the top of your error message.

jaklinger commented 4 years ago

After some discussion, we agreed the following:

(@kstathou no need to do anything, let's save it for discussion tomorrow)

jaklinger commented 4 years ago

Merging as is, since