Closed kstathou closed 4 years ago
Hi @kstathou thanks for that - it all looks good to me - just some minor stylistics points so far. I'll do a test run once you've implemented those fixes, and the travis tests are working. Thanks!
Thanks for the comments @jaklinger ! I made the changes, let me know if anything else is needed.
Great thanks for that! I've just put in a couple of changes to get it running on MySQL.
It's running now, but I'll leave the pipeline running to see how long it takes.
@kstathou I get the following error. Do you know how I should fix it?
2020-07-31 17:54:13.063 [1596210356438338/transform/2 (pid 68363)] WARNING:root:You try to use a model that was created with version 0.3.2, however, your version is 0.3.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)]
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)]
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)]
2020-07-31 17:54:13.064 [1596210356438338/transform/2 (pid 68363)] <flow Doc2ClusterFlow step transform> failed:
2020-07-31 17:54:13.071 [1596210356438338/transform/2 (pid 68363)] Internal error
2020-07-31 17:54:13.071 [1596210356438338/transform/2 (pid 68363)] Traceback (most recent call last):
2020-07-31 17:54:14.373 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/cli.py", line 857, in main
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] start(auto_envvar_prefix='METAFLOW', obj=state)
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 764, in __call__
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] return self.main(args, kwargs)
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 717, in main
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] rv = self.invoke(ctx)
2020-07-31 17:54:14.374 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 956, in invoke
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return ctx.invoke(self.callback, ctx.params)
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/core.py", line 555, in invoke
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return callback(args, kwargs)
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] return f(get_current_context().obj, args, kwargs)
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/cli.py", line 432, in step
2020-07-31 17:54:14.399 [1596210356438338/transform/2 (pid 68363)] max_user_code_retries)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/task.py", line 393, in run_step
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] self._exec_step_function(step_func)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/metaflow/task.py", line 47, in _exec_step_function
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] step_function()
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "doc2cluster.py", line 156, in transform
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] self.embeddings = model.encode(self.abstracts)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 124, in encode
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] sentences_tokenized = [self.tokenize(sen) for sen in sentences]
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 124, in <listcomp>
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] sentences_tokenized = [self.tokenize(sen) for sen in sentences]
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 184, in tokenize
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] return self._first_module().tokenize(text)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/sentence_transformers/models/Transformer.py", line 48, in tokenize
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] return self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text))
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 282, in tokenize
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
2020-07-31 17:54:14.400 [1596210356438338/transform/2 (pid 68363)] File "/Users/jklinger/anaconda3/envs/t2v/lib/python3.7/re.py", line 194, in sub
2020-07-31 17:54:14.401 [1596210356438338/transform/2 (pid 68363)] return _compile(pattern, flags).sub(repl, string, count)
2020-07-31 17:54:14.401 [1596210356438338/transform/2 (pid 68363)] TypeError: expected string or bytes-like object
2020-07-31 17:54:14.401 [1596210356438338/transform/2 (pid 68363)]
2020-07-31 17:54:14.404 [1596210356438338/transform/2 (pid 68363)] Task failed.
2020-07-31 17:54:14.404 Workflow failed.
2020-07-31 17:54:14.404 Terminating 0 active tasks...
2020-07-31 17:54:14.404 Flushing logs...
Step failure:
Step transform (task-id 2) failed.
@jaklinger I cannot access the database (I guess my IP changed :( ) My guess is that there are papers with missing abstracts. I am now filtering them in the start
step. Can you check if it works now? I also bumped sentence-transformers
to 0.3.2
to remove the warning at the top of your error message.
After some discussion, we agreed the following:
torch.cuda.is_available()
in order to avoid local running of the codemetaflow.batch
decorator with gpu-enabled AWS batch(@kstathou no need to do anything, let's save it for discussion tomorrow)
Merging as is, since
Metaflow pipeline that fetches arxiv articles from a SQL database, transforms their abstracts to vectors with Sentence Transformers and clusters them with HDBSCAN. You have the option to either fit a new HDBSCAN model or use an existing one.
It produces three outputs:
ArticleVector
SQL table.ArticleCluster
SQL table.I opted for:
transformers
library and you can plug and play with models by changing a string (see full list here). Note that this cannot run locally and probably needs a GPU for the whole arxiv (or batch it and send it to multiple CPUs).n_components
.You can run the pipeline as follows (details of these parameters in the
doc2cluster.py
):python doc2cluster.py --no-pylint run --transformer distilbert-base-nli-stsb-mean-tokens --db_config mysqldb.config --min_cluster_size 10 --min_samples 2 --new_clusterer True --clusterer_name hdbscan_model --s3_bucket <MY_BUCKET_NAME>
python doc2cluster.py --no-pylint run --transformer distilbert-base-nli-stsb-mean-tokens --db_config mysqldb.config --new_clusterer False --clusterer_name hdbscan_model --s3_bucket <MY_BUCKET_NAME>
I also added the
.metaflow/
directory in.gitignore
.