Flashtext not working in Pyspark

mvuckovic70 commented 3 years ago

I use Databricks on top of AWS. When I tried to execute flashtext replace command on Spark dataframe column to make multiple replacements from the dictionary, it throws an error:

PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30497.0 failed 4 times, most recent failure: Lost task 0.3 in stage 30497.0 (TID 1920456) (10.165.253.154 executor 2166): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 469, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'flashtext''. Full traceback below: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 469, in loads return pickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'flashtext'

Here is the code:

processor = flashtext.KeywordProcessor()

for k, v in generic_acronyms.items(): processor.add_keyword(k, v)

oxydata_titles_select = oxydata_titles_select\ .withColumn("title1", udf(lambda x: processor.replace_keywords(x), ArrayType(StringType()))(oxydata_titles_select['title0']))

Dobatymo commented 3 years ago

Seems flashtext is simply not installed on the nodes/workers. Make sure you didn't install only on the master.

mvuckovic70 commented 3 years ago

I have solved it by installing the library directly on cluster, rather by using pip install inside the code. It works now. Thanks.

vi3k6i5 / flashtext

Flashtext not working in Pyspark #126