Open dgoldenberg-audiomack opened 3 years ago
We don't offer specific guidance because everything that applies to scaling TensorFlow models applies here equally.
StringLookup
can accept either a numpy array or a path to a vocabulary file; either way millions of ids will be fine.I'm afraid I know very little about Spark, but the basic approach of generating training data using Spark, and post-processing results using Spark again makes sense to me. I imagine you'd also be able to load the trained model into memory in your Spark functions and call it, so (3) could in principle be done entirely in Spark.
Thanks, @maciejkula. This makes sense to me; I'll update the ticket as I experiment with larger datasets.
This is possibly a "naive" intuitive sense, however, "wouldn't it be nice" if TensorFlow's datasets could "ride" Spark datasets directly. It's probably wishful thinking but if there was a way to integrate Spark datasets directly into TF, as in, some TF dataset class being a subclass of Spark dataset, the framework would avoid having to deal with potentially massive amounts of intermediary files (on the input and on the output).
In other words, a TensorFlow dataset could become enormously powerful if it WERE a Spark dataset, with all the rich functionality it offers, including joins etc.etc.
I've filed https://github.com/tensorflow/tensorflow/issues/46236 for this.
Yes, @MLnick Spark-TF may seem as a place to look however, what it offers is a conversion of TF records into Spark.
Interop is the wrong word to use here; what I'm really after is the ability to do all dataprep, all TF training, and all data post-processing after the training, all in one cluster and all natively in Spark (or Flink, or Storm, or whatever one's clustering 'workhorse' happens to be).
The Spark-TF approach seems to imply that all 3 of these processes (dataprep, train, post-process) would be in separate clusters, or steps, and that is both cumbersome and expensive.
There is also TensorFlowOnSpark which purports to "By combining salient features from the TensorFlow deep learning framework with Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers." -- However, this is also obscure and hard to follow and integrate, IMO.
What I'm looking for is an easy way to do all 3 processes in one cluster and that would mean that TF would need to be natively Spark, i.e. a Tensor would be a Spark dataframe.
Additionally, I want to be able to distribute training easily on an AWS EMR cluster, and currently that's very cumbersome in TF. You have to specify a distribution strategy (and those are obscure). Then you have to jump through hoops in EMR to have it match that strategy. All the while you get long obscure stack traces from TF which doesn't like one thing or another and the error messages are cryptic.
What seems to be a lot more appropriate is BigDL, "a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters."
However, for example, I wanted to use TF-recommenders and that "rides" TF, not BigDL. So then one has to think of porting TF-recommenders to BigDL.... An arduous experience.
As a side note, TF-recommenders looks very attractive as it appears to support the notion of content-based filtering (CBF) rather than just collaborative filtering (CF). The latter is what most recommender frameworks provide and works a-OK but eventually one is likely to want to integrate in complex features, and CF doesn't allow you to do that as it only works with user ID's, item ID's, and events (interactions).
Indeed it's a problem that still has not been solved very well.
I think TF-on-spark is arguably one of the better options (still leaves a lot to be desired though) and as you mention BigDL a decent option too (but doesn't run on GPUs!).
Are you set on Spark? I think the PyData ecosystem projects especially Dask will have easier integration with TF, PyTorch etc.
I'll look at Dask however, like with anything, many folks these days are on AWS EMR or some equivalent technology in their services. What most folks would want, I'd venture to say, is something that seamlessly runs on one of those.
Using Dask immediately poses the question of how to plop it onto EMR or the equivalent.
Spark-TF-distributor is another one of those seemingly obvious things to look at but I had a problem making heads or tails of it and never got a clear explanation of what it does and how. E.g. see my tickets: https://github.com/tensorflow/ecosystem/issues/184 https://github.com/tensorflow/ecosystem/issues/178 https://github.com/tensorflow/ecosystem/issues/177
IMO -- considering that a "regular developer" wants to just focus on solving the business problem at hand, all this stitching together of iffy frameworks is far from being a seamless experience. Once somewhat stitched, you run into a number of hefty stack traces and then you try to get those resolved, which may take weeks or months.
@dgoldenberg-audiomack I agree, the experience is far from seamless currently.
The closest looking thing is actually: https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/Overview/orca.html (from the BigDL team at Intel I think).
I plan to give it a go on YARN to see how it works!
That does look interesting, @MLnick. What sort of use-case are you playing with?
I’d like to try scaling out a basic retrieval model using TF recommenders, to see how it can scale up to larger data size.
I think user/item ids would be overall somewhat limited in size because it would still need to fit into single node memory, but still should be possible to have a few million.
On Thu, 2 Sep 2021 at 16:45, Dmitry Goldenberg @.***> wrote:
That does look interesting, @MLnick https://github.com/MLnick. What sort of use-case are you playing with?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/recommenders/issues/201#issuecomment-911764585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH5EB5AIHADGDQ3QZ4BH3TT76EX5ANCNFSM4VXS2N7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Nice. I think it would be great to have a full example of how to scale TFRS so it's easy to take and plop into AWS EMR or the like with few tweaks.
The documentation does not appear to offer much guidance on how to scale up a TFRS based solution, for example, here: https://www.tensorflow.org/recommenders/examples/basic_retrieval.
To start, let's consider a (common, I'm sure) scenario of N million users and M million items (such as movies). We would be envisioning at least millions (billions) of feature records also, but for the moment we could set that aside and contemplate just the 'basic retrieval' use-case/example.
There are at least 3 considerations while designing an actual recommender, in terms of scale:
I can uniquefy my data in Spark when prepping Parquet for TFRS. However, it appears that building a model in TFRS is oriented toward using numpy arrays:
How are vocabularies expected to perform with millions of ID's? how will this affect memory and if this is not expected to scale, what alternatives are there to vocabulary/numpy array approach?
All in all, what this feels like is potentially a need for a tighter integration with Spark datasets where it is easy and seamless to load TF datasets from Spark and then just as easy to convert the resulting recommendations (e.g. a
tfrs.layers.factorized_top_k.BruteForce
index) to Spark datasets.So far, I don't seem to see an easy way to do this (?). Some considerations: