Closed hashirabdulbasheer closed 3 years ago
I saw a talk recently on comparison of multiple languages. saw a few slides on how they visualise it. The visualisations with words from different languages, but with similar meanings, produce the similar visualisations. It also had lot of information about arabic nlp and its challenges.
I've written a first version of a benchmarking guide here.
@hashirabdulbasheer let me know what you think :) if there's anything in-accurate here I'd love to hear it.
The main thing I'd quickly like to get feedback on; the interactive scatter plots ... does the tooltip print the text in the correct order? I know matplotlib has had some issues with this in the past so I just want to make sure it's not happening here before I make any announcements.
Awesome, that is superb. Thank you so much. Good work.
I haven't checked it out in detail. Will do that next and get back to you with all the questions.
For the arabic tooltip, did you mean the ones on the clusters, as shown in the image below. That looks perfect.
I am stuck in this step 2.
error 1: FileNotFoundError: [Errno 2] No such file or directory: 'test_Arabic_tweets_negative_20190413.tsv'
Then I gave the right path to the tsv file in my directory. Maybe we have to tell the users that they have to put the file in the same directory as the notebook?
After I gave the path, I get this error:
error2: at -> df.columns = ["label", "text"] ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements
I am at the step executing the below code:
import pandas as pd
from whatlies.transformers import Umap
# Read in the dataframes from Kaggle
df = pd.concat([
pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t"),
pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t")
], axis=0).sample(frac=1).reset_index(drop=True)
df.columns = ["label", "text"]
# Sample a small list such that the interactive charts render swiftly.
small_text_list = list(set(df[:1000]['text']))
def mk_plot(lang, title=""):
return (lang[small_text_list]
.transform(Umap(2))
.plot_interactive(annot=False)
.properties(title=title, width=200, height=200))
mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")
This worked for me.
# Read in the dataframes from Kaggle
df = pd.concat([
pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]),
pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"])
], axis=0).sample(frac=1).reset_index(drop=True)
but getting this error randomly. Very randomly. Not always.
RuntimeError: The size of tensor a (609) must match the size of tensor b (512) at non-singleton dimension 1
RuntimeError Traceback (most recent call last)
<ipython-input-33-b551f7b93936> in <module>
19 .properties(title=title, width=200, height=200))
20
---> 21 mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")
<ipython-input-33-b551f7b93936> in mk_plot(lang, title)
14
15 def mk_plot(lang, title=""):
---> 16 return (lang[small_text_list]
17 .transform(Umap(2))
18 .plot_interactive(annot=False)
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in __getitem__(self, query)
76 if isinstance(query, str):
77 return self._get_embedding(query)
---> 78 return EmbeddingSet(*[self._get_embedding(q) for q in query])
79
80 def _get_embedding(self, query: str):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in <listcomp>(.0)
76 if isinstance(query, str):
77 return self._get_embedding(query)
---> 78 return EmbeddingSet(*[self._get_embedding(q) for q in query])
79
80 def _get_embedding(self, query: str):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in _get_embedding(self, query)
79
80 def _get_embedding(self, query: str):
---> 81 features = np.array(self.model(query, padding=False)[0])
82 special_tokens_mask = self.model.tokenizer(
83 query, return_special_tokens_mask=True, return_tensors="np"
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
732 A nested list of :obj:`float`: The features computed by the model.
733 """
--> 734 return super().__call__(*args, **kwargs).tolist()
735
736
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
634 def __call__(self, *args, **kwargs):
635 inputs = self._parse_and_tokenize(*args, **kwargs)
--> 636 return self._forward(inputs)
637
638 def _forward(self, inputs, return_tensors=False):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
655 with torch.no_grad():
656 inputs = self.ensure_tensor_on_device(**inputs)
--> 657 predictions = self.model(**inputs)[0].cpu()
658
659 if return_tensors:
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
836
837 embedding_output = self.embeddings(
--> 838 input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
839 )
840 encoder_outputs = self.encoder(
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
199 token_type_embeddings = self.token_type_embeddings(token_type_ids)
200
--> 201 embeddings = inputs_embeds + position_embeddings + token_type_embeddings
202 embeddings = self.LayerNorm(embeddings)
203 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (609) must match the size of tensor b (512) at non-singleton dimension 1
Just saw that, in your notebook, you had used this code. maybe we should add this to the documentation?
# Read in the dataframes from Kaggle
df = pd.concat([
pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]),
pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"])
], axis=0).loc[lambda d: d['text'].str.len() < 200].sample(frac=1).reset_index(drop=True).drop_duplicates()
In the benchmarking part, it is getting stuck at 75%. any idea?
It's likely not stuck, it's switching to the huggingface model which ... takes a lot longer.
I think all of your errors were caused by the missing dataframe code. I've added that as well as some extra comments mentioned by you. I'm pushing to github now. Changes should be live in 2 minutes.
When I scrolled down, I saw the reason why it was at 75%. There was this exception:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-140212ac660b> in <module>
52 train_size=[100, 250, 500, 1000, 2000,
53 3000, 4000, 5000, 6000, 7000]):
---> 54 run_experiment(**setting)
55 print(setting)
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/memo/_base.py in wrapper(*args, **kwargs)
73 @wraps(func)
74 def wrapper(*args, **kwargs):
---> 75 result = func(*args, **kwargs)
76 with open(filepath, "a") as f:
77 ser = orjson.dumps(
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/memo/_util.py in wrapper(*args, **kwargs)
42 def wrapper(*args, **kwargs):
43 tic = time.time()
---> 44 result = func(*args, **kwargs)
45 toc = time.time()
46 time_total = toc - tic
<ipython-input-4-140212ac660b> in run_experiment(embedder, train_size, smooth, ngram)
43 # By returning a dictionary `memo` will be able to properly log this.
44 return {"valid_accuracy": float(np.mean(y_test == y_pred)),
---> 45 "train": float(np.mean(y_train == pipe.predict(X_train)))}
46
47 # The grid will loop over all the options and generate a progress bar
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
114
115 # lambda, but not partial, allows help() to work with update_wrapper
--> 116 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
117 # update the docstring of the returned function
118 update_wrapper(out, self.fn)
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
417 Xt = X
418 for _, name, transform in self._iter(with_final=False):
--> 419 Xt = transform.transform(Xt)
420 return self.steps[-1][-1].predict(Xt, **predict_params)
421
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/pipeline.py in transform(self, X)
982 Xs = Parallel(n_jobs=self.n_jobs)(
983 delayed(_transform_one)(trans, X, None, weight)
--> 984 for name, trans, weight in self._iter())
985 if not Xs:
986 # All transformers are None
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
1049 self._iterating = self._original_iterator is not None
1050
-> 1051 while self.dispatch_one_batch(iterator):
1052 pass
1053
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
864 return False
865 else:
--> 866 self._dispatch(tasks)
867 return True
868
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
782 with self._lock:
783 job_idx = len(self._jobs)
--> 784 job = self._backend.apply_async(batch, callback=cb)
785 # A job can complete so quickly than its callback is
786 # called before we get here, causing self._jobs to
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
705
706 def _transform_one(transformer, X, y, weight, **fit_params):
--> 707 res = transformer.transform(X)
708 # if we have a weight for this transformer, multiply output
709 if weight is None:
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_common.py in transform(self, X)
26 if not np.array(X).dtype.type is np.str_:
27 raise ValueError("You must give this preprocessor text as input.")
---> 28 return np.array([self[x].vector for x in X])
29
30
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_common.py in <listcomp>(.0)
26 if not np.array(X).dtype.type is np.str_:
27 raise ValueError("You must give this preprocessor text as input.")
---> 28 return np.array([self[x].vector for x in X])
29
30
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in __getitem__(self, query)
75 """
76 if isinstance(query, str):
---> 77 return self._get_embedding(query)
78 return EmbeddingSet(*[self._get_embedding(q) for q in query])
79
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in _get_embedding(self, query)
79
80 def _get_embedding(self, query: str):
---> 81 features = np.array(self.model(query, padding=False)[0])
82 special_tokens_mask = self.model.tokenizer(
83 query, return_special_tokens_mask=True, return_tensors="np"
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
732 A nested list of :obj:`float`: The features computed by the model.
733 """
--> 734 return super().__call__(*args, **kwargs).tolist()
735
736
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
634 def __call__(self, *args, **kwargs):
635 inputs = self._parse_and_tokenize(*args, **kwargs)
--> 636 return self._forward(inputs)
637
638 def _forward(self, inputs, return_tensors=False):
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
655 with torch.no_grad():
656 inputs = self.ensure_tensor_on_device(**inputs)
--> 657 predictions = self.model(**inputs)[0].cpu()
658
659 if return_tensors:
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
836
837 embedding_output = self.embeddings(
--> 838 input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
839 )
840 encoder_outputs = self.encoder(
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
199 token_type_embeddings = self.token_type_embeddings(token_type_ids)
200
--> 201 embeddings = inputs_embeds + position_embeddings + token_type_embeddings
202 embeddings = self.LayerNorm(embeddings)
203 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (1202) must match the size of tensor b (512) at non-singleton dimension 1
Strange. Did you remove the rows in the data frame that were too long?
some code in the document broke formatting, so I am not sure. here is what I ran. in my case, the test_Arabic_tweets_negative_20190413.tsv is in a different folder. will that matter?
it is now 76%. But I am not sure if its running or not. its been 20 mins. If it errors then I am planning to remove the huggingface and check again.
import pandas as pd
from whatlies.transformers import Umap
df = pd.concat([ pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]),
pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"]) ],
axis=0).loc[lambda d: d['text'].str.len() < 200].sample(frac=1).reset_index(drop=True).drop_duplicates()
small_text_list = list(set(df[:500]['text']))
small_labels = df[:800]['label']
len(small_text_list)
len(small_labels)
# Sample a small list such that the interactive charts render swiftly.
# small_text_list = list(set(df[:1000]['text']))
def mk_plot(lang, title=""):
return (lang[small_text_list]
.transform(Umap(2))
.plot_interactive(annot=False)
.properties(title=title, width=200, height=200))
mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")
it is 76% and got one line from huggingface. after 20 mins.
{'embedder': 'hf', 'smooth': 1, 'ngram': True, 'train_size': 100}
It looks like its running. but slow. got two lines now. but there is a warning that popped up.
{'embedder': 'hf', 'smooth': 1, 'ngram': True, 'train_size': 100} {'embedder': 'hf', 'smooth': 1, 'ngram': True, 'train_size': 250}
/Users/hashir/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
The warning that you receive is from sklearn. It seems the logistic regression isn't 100% converging. You can raise n_iters
to remove it but it also makes sense that it's having trouble converging on such a small dataset. I think the error goes away once you've got bigger datasets.
its taking too long for the huggingface. its over 30 mins and I just have three lines.
It does seriously take a while, I was running it on a somewhat big CPU though (6cores/12 threads) and it was able to parallise nicely. What kind of machine do you have?
I am running on MacBook - 2 GHz Quad-Core Intel Core i5. maybe that's why.
could we keep huggingface separate? as an optional. that way, people could see the graphs quickly on the first try and then if they are interested enable huggingface. otherwise, I am afraid they will think its not working. initially, I thought it was stuck.
That's fair. Let me make some adjustments to the article.
did you do anything for threading. maybe its not dividing into threads in my case?
What operating system are you using? I was running my benchmark on a small linux server. I didn't do any conscious allocations of threads, but it might be that hugginface is clever enough to arrange that for me for my OS.
Actually, I am running on my local laptop. not on any server. I installed Jupyter notebook and then ran the command 'Jupiter notebook' to browse and open the notebook.
can we run it on cloud like google colab ?
I haven't tried that yet, technically there might be a speedup, you're free to try that out but I don't see exploring that as a hard requirement for this short guide.
Since there is a guide live now on the docs, so I'll consider this issue fixed.
@hashirabdulbasheer thanks for the prompt! 👍
Hi
I checked out Rasa Whatlies on arabic and English using BytePairLanguage. I used pca and umap to see if similar messages cluster. But they don't. Both for English and arabic, they don't seems to cluster.
I used tweets from two categories:
For Arabic version here are the graphs
1) PCA : https://drive.google.com/file/d/1Dtdpaigzv6SqhLuL6GJT-HRK5QaZh_0z/view?usp=sharing 2) UMAP: https://drive.google.com/file/d/1WvawUrtNWNg7yzeAMCBmeC18guZGc21M/view?usp=sharing
For English:
1) PCA: https://drive.google.com/file/d/15kes9eKEnLognix_E5w9QZ_ixHjPab56/view?usp=sharing 2) UMAP: https://drive.google.com/file/d/12E5V_Q0B73nkUxdpdmIpYRb-bDcb9Hwd/view?usp=sharing
Any ideas?
thanks hashir