BytePairLanguage plots for arabic tweets

hashirabdulbasheer commented 3 years ago

Hi

I checked out Rasa Whatlies on arabic and English using BytePairLanguage. I used pca and umap to see if similar messages cluster. But they don't. Both for English and arabic, they don't seems to cluster.

I used tweets from two categories:

CATEGORY	ARABIC	ENGLISH

TRANSFER	السلام عليكم يلزمنا حجز موعد للجوازات لاستلام اقامة بعد نقل خدمات عمالة مهنية ولكن لا اجد عند اجراء حجز الموعد مايخص استلام الاقامة	Peace be upon you. We need to book an appointment for passports to receive residency after transferring professional employment services, but I do not find when making appointment reservations about receiving residency
TRANSFER	نقل خدمات عبر ابشر ولكن من دون طباعة الاقامة، يعني لازم مراجعة الجوازات للطباعة، ايش الفايدة،اراجع الجوازات وانقل واطبع اذا كان في الاخير لازم مراجعة	Transferring services via Absher, but without printing the residency, I mean, it is necessary to review passports for printing, what is the payment, return passports, transfer and print if in the last one is necessary
TRANSFER	السلام عليكم هل بالإمكان تغيير مهنة العامل المنزلي إلى مهنه عمل أخرى لنفس الكفيل مع وجود العامل داخل المملكة. وشكرا	Peace be upon you. Is it possible to change the profession of the domestic worker to another profession for the same sponsor, with the presence of the worker inside the Kingdom? Thank you

TRAFFIC	السلام عليكم ورحمة الله وبركاتة يااخوان انا جتني مخالفة مرورية بالغلط وقدمت اعتراض وارفضوه وفي مكان بحياتي ماجيته ولاوصلته كيف الطريقة اسعفوني	Peace, mercy and blessings of God be upon you brothers. I accidentally got a traffic violation and lodged an objection and rejected it.
TRAFFIC	احاول بيع سيارتي عن طريق ابشر وعند اتمام بيع جميع البيانات تظهر هذه الجمله ( هذا الشخص لا يملك هذه السياره ) و عند مراجعه اداره مرور يفيدني الموظف ان عليها حظر نقل ملكيه من قبل الوكاله مع العلم اني انا الملك وتم مراجعه تايوتا الوكاله ولا يوجد حظر على المركبه	I am trying to sell my car through Absher, and upon completion of the sale of all the data, this sentence appears (This person does not own this car), and when reviewing the Traffic Department, the employee informs me that it must prohibit the transfer of ownership by the agency, knowing that I am the king and the Toyota agency has been reviewed and there is no ban on Vehicle
TRAFFIC	السلام عليكم تم تجديد مركبه عن طريق تطبيق الراجحي وعند نقل ملكيه المركبه في ابشر يرفض بسبب عدم السداد ؟؟	Peace be upon you. His vehicle was renewed through the Al-Rajhi application and when the ownership of the vehicle was transferred in Absher, it was rejected due to non-payment ??

For Arabic version here are the graphs

1) PCA : https://drive.google.com/file/d/1Dtdpaigzv6SqhLuL6GJT-HRK5QaZh_0z/view?usp=sharing 2) UMAP: https://drive.google.com/file/d/1WvawUrtNWNg7yzeAMCBmeC18guZGc21M/view?usp=sharing

For English:

1) PCA: https://drive.google.com/file/d/15kes9eKEnLognix_E5w9QZ_ixHjPab56/view?usp=sharing 2) UMAP: https://drive.google.com/file/d/12E5V_Q0B73nkUxdpdmIpYRb-bDcb9Hwd/view?usp=sharing

Any ideas?

thanks hashir

hashirabdulbasheer commented 3 years ago

I saw a talk recently on comparison of multiple languages. saw a few slides on how they visualise it. The visualisations with words from different languages, but with similar meanings, produce the similar visualisations. It also had lot of information about arabic nlp and its challenges.

its here: https://www.youtube.com/watch?v=kRrIvcADLUw

koaning commented 3 years ago

I've written a first version of a benchmarking guide here.

@hashirabdulbasheer let me know what you think :) if there's anything in-accurate here I'd love to hear it.

The main thing I'd quickly like to get feedback on; the interactive scatter plots ... does the tooltip print the text in the correct order? I know matplotlib has had some issues with this in the past so I just want to make sure it's not happening here before I make any announcements.

hashirabdulbasheer commented 3 years ago

Awesome, that is superb. Thank you so much. Good work.

I haven't checked it out in detail. Will do that next and get back to you with all the questions.

For the arabic tooltip, did you mean the ones on the clusters, as shown in the image below. That looks perfect.

hashirabdulbasheer commented 3 years ago

I am stuck in this step 2.

error 1: FileNotFoundError: [Errno 2] No such file or directory: 'test_Arabic_tweets_negative_20190413.tsv'

Then I gave the right path to the tsv file in my directory. Maybe we have to tell the users that they have to put the file in the same directory as the notebook?

After I gave the path, I get this error:

error2: at -> df.columns = ["label", "text"] ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements

I am at the step executing the below code:


import pandas as pd

from whatlies.transformers import Umap

# Read in the dataframes from Kaggle
df = pd.concat([
    pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t"),
    pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t")
], axis=0).sample(frac=1).reset_index(drop=True)
df.columns = ["label", "text"]

# Sample a small list such that the interactive charts render swiftly.
small_text_list = list(set(df[:1000]['text']))

def mk_plot(lang, title=""):
    return (lang[small_text_list]
            .transform(Umap(2))
            .plot_interactive(annot=False)
            .properties(title=title, width=200, height=200))

mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")

hashirabdulbasheer commented 3 years ago

This worked for me.

# Read in the dataframes from Kaggle
df = pd.concat([
    pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]),
    pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"])
], axis=0).sample(frac=1).reset_index(drop=True)

but getting this error randomly. Very randomly. Not always.

RuntimeError: The size of tensor a (609) must match the size of tensor b (512) at non-singleton dimension 1


RuntimeError                              Traceback (most recent call last)
<ipython-input-33-b551f7b93936> in <module>
     19             .properties(title=title, width=200, height=200))
     20 
---> 21 mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")

<ipython-input-33-b551f7b93936> in mk_plot(lang, title)
     14 
     15 def mk_plot(lang, title=""):
---> 16     return (lang[small_text_list]
     17             .transform(Umap(2))
     18             .plot_interactive(annot=False)

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in __getitem__(self, query)
     76         if isinstance(query, str):
     77             return self._get_embedding(query)
---> 78         return EmbeddingSet(*[self._get_embedding(q) for q in query])
     79 
     80     def _get_embedding(self, query: str):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in <listcomp>(.0)
     76         if isinstance(query, str):
     77             return self._get_embedding(query)
---> 78         return EmbeddingSet(*[self._get_embedding(q) for q in query])
     79 
     80     def _get_embedding(self, query: str):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in _get_embedding(self, query)
     79 
     80     def _get_embedding(self, query: str):
---> 81         features = np.array(self.model(query, padding=False)[0])
     82         special_tokens_mask = self.model.tokenizer(
     83             query, return_special_tokens_mask=True, return_tensors="np"

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    732             A nested list of :obj:`float`: The features computed by the model.
    733         """
--> 734         return super().__call__(*args, **kwargs).tolist()
    735 
    736 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    634     def __call__(self, *args, **kwargs):
    635         inputs = self._parse_and_tokenize(*args, **kwargs)
--> 636         return self._forward(inputs)
    637 
    638     def _forward(self, inputs, return_tensors=False):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
    655                 with torch.no_grad():
    656                     inputs = self.ensure_tensor_on_device(**inputs)
--> 657                     predictions = self.model(**inputs)[0].cpu()
    658 
    659         if return_tensors:

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
    836 
    837         embedding_output = self.embeddings(
--> 838             input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
    839         )
    840         encoder_outputs = self.encoder(

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    199         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    200 
--> 201         embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    202         embeddings = self.LayerNorm(embeddings)
    203         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (609) must match the size of tensor b (512) at non-singleton dimension 1

hashirabdulbasheer commented 3 years ago

Just saw that, in your notebook, you had used this code. maybe we should add this to the documentation?

# Read in the dataframes from Kaggle
df = pd.concat([
    pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]),
    pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"])
], axis=0).loc[lambda d: d['text'].str.len() < 200].sample(frac=1).reset_index(drop=True).drop_duplicates()

hashirabdulbasheer commented 3 years ago

In the benchmarking part, it is getting stuck at 75%. any idea?

koaning commented 3 years ago

It's likely not stuck, it's switching to the huggingface model which ... takes a lot longer.

koaning commented 3 years ago

I think all of your errors were caused by the missing dataframe code. I've added that as well as some extra comments mentioned by you. I'm pushing to github now. Changes should be live in 2 minutes.

hashirabdulbasheer commented 3 years ago

When I scrolled down, I saw the reason why it was at 75%. There was this exception:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-140212ac660b> in <module>
     52                     train_size=[100, 250, 500, 1000, 2000,
     53                                 3000, 4000, 5000, 6000, 7000]):
---> 54     run_experiment(**setting)
     55     print(setting)

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/memo/_base.py in wrapper(*args, **kwargs)
     73         @wraps(func)
     74         def wrapper(*args, **kwargs):
---> 75             result = func(*args, **kwargs)
     76             with open(filepath, "a") as f:
     77                 ser = orjson.dumps(

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/memo/_util.py in wrapper(*args, **kwargs)
     42         def wrapper(*args, **kwargs):
     43             tic = time.time()
---> 44             result = func(*args, **kwargs)
     45             toc = time.time()
     46             time_total = toc - tic

<ipython-input-4-140212ac660b> in run_experiment(embedder, train_size, smooth, ngram)
     43     # By returning a dictionary `memo` will be able to properly log this.
     44     return {"valid_accuracy": float(np.mean(y_test == y_pred)),
---> 45             "train": float(np.mean(y_train == pipe.predict(X_train)))}
     46 
     47 # The grid will loop over all the options and generate a progress bar

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    417         Xt = X
    418         for _, name, transform in self._iter(with_final=False):
--> 419             Xt = transform.transform(Xt)
    420         return self.steps[-1][-1].predict(Xt, **predict_params)
    421 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/pipeline.py in transform(self, X)
    982         Xs = Parallel(n_jobs=self.n_jobs)(
    983             delayed(_transform_one)(trans, X, None, weight)
--> 984             for name, trans, weight in self._iter())
    985         if not Xs:
    986             # All transformers are None

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1049                 self._iterating = self._original_iterator is not None
   1050 
-> 1051             while self.dispatch_one_batch(iterator):
   1052                 pass
   1053 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    864                 return False
    865             else:
--> 866                 self._dispatch(tasks)
    867                 return True
    868 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    782         with self._lock:
    783             job_idx = len(self._jobs)
--> 784             job = self._backend.apply_async(batch, callback=cb)
    785             # A job can complete so quickly than its callback is
    786             # called before we get here, causing self._jobs to

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/pipeline.py in _transform_one(transformer, X, y, weight, **fit_params)
    705 
    706 def _transform_one(transformer, X, y, weight, **fit_params):
--> 707     res = transformer.transform(X)
    708     # if we have a weight for this transformer, multiply output
    709     if weight is None:

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_common.py in transform(self, X)
     26         if not np.array(X).dtype.type is np.str_:
     27             raise ValueError("You must give this preprocessor text as input.")
---> 28         return np.array([self[x].vector for x in X])
     29 
     30 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_common.py in <listcomp>(.0)
     26         if not np.array(X).dtype.type is np.str_:
     27             raise ValueError("You must give this preprocessor text as input.")
---> 28         return np.array([self[x].vector for x in X])
     29 
     30 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in __getitem__(self, query)
     75         """
     76         if isinstance(query, str):
---> 77             return self._get_embedding(query)
     78         return EmbeddingSet(*[self._get_embedding(q) for q in query])
     79 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/whatlies/language/_hftransformers_lang.py in _get_embedding(self, query)
     79 
     80     def _get_embedding(self, query: str):
---> 81         features = np.array(self.model(query, padding=False)[0])
     82         special_tokens_mask = self.model.tokenizer(
     83             query, return_special_tokens_mask=True, return_tensors="np"

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    732             A nested list of :obj:`float`: The features computed by the model.
    733         """
--> 734         return super().__call__(*args, **kwargs).tolist()
    735 
    736 

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    634     def __call__(self, *args, **kwargs):
    635         inputs = self._parse_and_tokenize(*args, **kwargs)
--> 636         return self._forward(inputs)
    637 
    638     def _forward(self, inputs, return_tensors=False):

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
    655                 with torch.no_grad():
    656                     inputs = self.ensure_tensor_on_device(**inputs)
--> 657                     predictions = self.model(**inputs)[0].cpu()
    658 
    659         if return_tensors:

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
    836 
    837         embedding_output = self.embeddings(
--> 838             input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
    839         )
    840         encoder_outputs = self.encoder(

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

~/.pyenv/versions/3.7.6/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    199         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    200 
--> 201         embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    202         embeddings = self.LayerNorm(embeddings)
    203         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (1202) must match the size of tensor b (512) at non-singleton dimension 1

koaning commented 3 years ago

Strange. Did you remove the rows in the data frame that were too long?

hashirabdulbasheer commented 3 years ago

some code in the document broke formatting, so I am not sure. here is what I ran. in my case, the test_Arabic_tweets_negative_20190413.tsv is in a different folder. will that matter?

it is now 76%. But I am not sure if its running or not. its been 20 mins. If it errors then I am planning to remove the huggingface and check again.


import pandas as pd

from whatlies.transformers import Umap

df = pd.concat([ pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]), 
                pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"]) ], 
               axis=0).loc[lambda d: d['text'].str.len() < 200].sample(frac=1).reset_index(drop=True).drop_duplicates()

small_text_list = list(set(df[:500]['text'])) 
small_labels = df[:800]['label']

len(small_text_list)
len(small_labels)

# Sample a small list such that the interactive charts render swiftly.
# small_text_list = list(set(df[:1000]['text']))

def mk_plot(lang, title=""):
    return (lang[small_text_list]
            .transform(Umap(2))
            .plot_interactive(annot=False)
            .properties(title=title, width=200, height=200))

mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")

hashirabdulbasheer commented 3 years ago

it is 76% and got one line from huggingface. after 20 mins.

{'embedder': 'hf', 'smooth': 1, 'ngram': True, 'train_size': 100}

hashirabdulbasheer commented 3 years ago

It looks like its running. but slow. got two lines now. but there is a warning that popped up.

{'embedder': 'hf', 'smooth': 1, 'ngram': True, 'train_size': 100} {'embedder': 'hf', 'smooth': 1, 'ngram': True, 'train_size': 250}

/Users/hashir/.pyenv/versions/3.7.6/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

koaning commented 3 years ago

The warning that you receive is from sklearn. It seems the logistic regression isn't 100% converging. You can raise n_iters to remove it but it also makes sense that it's having trouble converging on such a small dataset. I think the error goes away once you've got bigger datasets.

hashirabdulbasheer commented 3 years ago

its taking too long for the huggingface. its over 30 mins and I just have three lines.

koaning commented 3 years ago

It does seriously take a while, I was running it on a somewhat big CPU though (6cores/12 threads) and it was able to parallise nicely. What kind of machine do you have?

hashirabdulbasheer commented 3 years ago

I am running on MacBook - 2 GHz Quad-Core Intel Core i5. maybe that's why.

hashirabdulbasheer commented 3 years ago

could we keep huggingface separate? as an optional. that way, people could see the graphs quickly on the first try and then if they are interested enable huggingface. otherwise, I am afraid they will think its not working. initially, I thought it was stuck.

koaning commented 3 years ago

That's fair. Let me make some adjustments to the article.

hashirabdulbasheer commented 3 years ago

did you do anything for threading. maybe its not dividing into threads in my case?

koaning commented 3 years ago

What operating system are you using? I was running my benchmark on a small linux server. I didn't do any conscious allocations of threads, but it might be that hugginface is clever enough to arrange that for me for my OS.

hashirabdulbasheer commented 3 years ago

Actually, I am running on my local laptop. not on any server. I installed Jupyter notebook and then ran the command 'Jupiter notebook' to browse and open the notebook.

can we run it on cloud like google colab ?

koaning commented 3 years ago

I haven't tried that yet, technically there might be a speedup, you're free to try that out but I don't see exploring that as a hard requirement for this short guide.

koaning commented 3 years ago

Since there is a guide live now on the docs, so I'll consider this issue fixed.

@hashirabdulbasheer thanks for the prompt! 👍

koaning / whatlies

BytePairLanguage plots for arabic tweets #262