Closed rjurney closed 4 years ago
Oh, one thing I thought about... rather than use -1 as obstain and integers as classes in a dense matrix, it would be incredibly more efficient to use a sparse matrix and impute abstain if possible. At least in my application.
Hi @rjurney, this is great, thanks for reporting!! Will mark as Q&A for other folks looking for answers. This is - I think - a somewhat unique case since you have thousands of LFs and are running single-node Spark (see follow-up on #1500), but we discussed the design decisions and future plans around sparse matrices a bit here: https://github.com/snorkel-team/snorkel/pull/1309#issuecomment-545203275
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Issue description
While this isn't the normal use pattern, I wanted to see if you want me to add a note to the documentation for SparkLFApplier to increase/set 0 the value of
spark.driver.maxResultSize
if you use a lot of LFs on a lot of data, otherwise you get this error:It indicates that the label numpy.array returned by
SparkLFApplier
when it callscollect()
has causedspark.driver.maxResultSize
of 1GB to be exceeded. This is because I have 1.7 million records and 1800 LFs. The final numpy array is 1.7M x 1800 and totals 24GB. This is a Spark configuration issue, but for multi-task problems it is one people may run into.It dies on the second to last line below:
I set it to 0 via
spark-submit --conf spark.driver.maxResultSize=0
and it runs ok. The call tonumpy.save()
results in a 24GB file, so not everyone will have this happen but for multi-task problems it seems like it might be common enough.Code example/repro steps
The script I ran is this: https://github.com/rjurney/weakly_supervised_learning_code/blob/0491c0a42ac7a2af9bf72287c5d7e9ec76ebff9c/ch05/label.spark.py
via
spark-submit ch05/label.spark.py
and after a while I get this:Expected behavior
I expect it to return a large numpy array :)
System info
pip install -e .
pyspark==2.4.4
Additional context
Add any other context about the problem here.