Closed S-C-H closed 3 years ago
Updated.
My approach to tackle labeling 100m+ rows has been:
This is quite slow on a nice-size cluster. I'm thinking a Pandas UDF might be a better approach here or attempt to create the L - rule based array on the entire dataset then predict in chunks. I'll look to try that next.
Update:
Tried
I second @S-C-H, returning a numpy array is strange and could pose issues when working with Spark. My DataFrame
is a sample of ~2.5M rows. I currently feed that L matrix as an np.array
directly to the fit method. This does work on this scale but I assume it's a matter of scale or the cardinality of the LFs until this setup will break.
I'll follow for any updates on this issue.
Update: Success, kind of!
After a lot of trial and error, I reached to a working solution.
tl;dr - I used mapPartition
to predict locally on each partition, using the pandasApplier
interface.
What I tried and didn't work (partial list):
mapPartitions
, getting lfs
(list of labeling functions) to play well was the main challenge. I tried:
lfs
list and reading in the broadcast value either in each of the executors or in the mapPartitions
function itself;lf
and reading the broadcast value in the same manner.lfs
as an argument to the mapPartitions
function;pickle
/ dill
to save the lfs
list or each of the lf
and loading the objects in each of the executors or in the mapPartitions
function itself;All of the above methods encountered either closure or pickling issues. It seems like pickle
couldn't handle the lf
decorator syntax and replacing pickle
with dill
seem to had pyspark
-specific issues (dill
tried to and failed to write the spark context to file).
Here's the version that did work. The idea was to define the lfs
object as a closure so it will be available in the executors. Then I produce a pd.DataFrame
in each executor and proceed with pandasApplier
API.
Note: I had issues converting between Pandas and Pyspark types, since int
is nullable in Pyspark but not in native Pyothn / Pandas. Here I replace null
values with -1
(which I know is not present in my data) and later in the classifier pipeline, the imputer treats the -1
values as missing.
_model_bc = sc.broadcast(label_model)
def get_proba_labels(df, spark):
model = _model_bc.value
lfs = [lf1, lf2, ...]
dtypes_list = df.dtypes
def predict_proba_iterator(rows, dtypes_list):
pd_df = pd.DataFrame([row.asDict() for row in rows])
for col, dtype in dtypes_list:
if dtype == 'int':
pd_df[col] = pd_df[col].fillna(-1).astype(int) # int can be nullable in spark but not in native python
pd_df = predict_proba_pd_df(pd_df)
for _, pd_row in pd_df.iterrows():
yield T.Row(**pd_row.to_dict())
def predict_proba_pd_df(df):
applier = PandasLFApplier(lfs)
l_matrix = applier.apply(df)
df['label_proba'] = model.predict_proba(l_matrix).tolist()
df['label_argmax'] = np.vectorize(np.argmax)(df['label_proba'])
return df
output_schema = T.StructType(
df.schema.fields + [
T.StructField('label_proba', T.ArrayType(T.FloatType())),
T.StructField('label_argmax', T.IntegerType())
]
)
rdd = df.rdd.mapPartitions(lambda x: predict_proba_iterator(x, dtypes_list))
res_df = spark.createDataFrame(
data=rdd,
schema=output_schema
).select(df.columns + [
'label_proba',
'label_argmax'
])
return res_df
df_train_labeled = get_proba_labels(
df_train,
spark=spark
)
df_train_labeled.cache()
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
keep
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
I find it strange that the SparkLFApplier and other, associated, functions return Numpy arrays instead of Spark RDDs or Columns.
I'm unsure as to how a numpy array is then supposed to be recombined with the RDD or dataframe given these are distributed objects.
The nicest solution would be to allow the user to go: df.withColumn(SparkLFApplier(*cols))
https://github.com/snorkel-team/snorkel/blob/master/snorkel/labeling/apply/spark.py
Edit: I've successfully re-combined this (I think). But I'm finding scalability quite challenging - this is where I feel a better connection between Snorkel and Spark would be helpful.