simon376 / ds-recommender-project

goodreads-based book recommendation system, created for class "data science"
0 stars 0 forks source link

Use Dask or tf.data to improve input data pipeline #3

Closed simon376 closed 2 years ago

simon376 commented 2 years ago

currently, only about 50k datapoints are used for training, since loading json into a pandas dataframes exhausts available RAM (even on google colab) quickly, so try to write a generator to batch-load data or something.

because the raw input data has to be preprocessed in pandas, maybe using dask dataframes instead of pandas may be the key to success. since tf.data doesn't come with a read_json helper natively and preprocessing the data in pandas is easiest (and already implemented)

simon376 commented 2 years ago

Dask doesn't play well with tf

simon376 commented 2 years ago

one idea was to do a chunked read on the JSON data and save it as an CSV and then use the tf.data utility functions to create csv datasets. alternatively use the jsonreader returned as a generator, yielding chunks.

but that doesnt work either, cause i need to join the dataframes from different sources on some keys, and that's only possible if all the keys are known ._.

simon376 commented 2 years ago

might still try to write the separate json files to csv for now, may at least improve reading performance on dataframe creation?

simon376 commented 2 years ago

one more alternative may be to overload the tf.data.Dataset to create an own implementation. but that wouldn't really help performance in any way

simon376 commented 2 years ago

might still try to write the separate json files to csv for now, may at least improve reading performance on dataframe creation?

reading CSV GREATLY improves performance, I can now read 700k datapoints into pandas, easily 😎