Closed simon376 closed 2 years ago
Dask doesn't play well with tf
one idea was to do a chunked read on the JSON data and save it as an CSV and then use the tf.data utility functions to create csv datasets. alternatively use the jsonreader returned as a generator, yielding chunks.
but that doesnt work either, cause i need to join the dataframes from different sources on some keys, and that's only possible if all the keys are known ._.
might still try to write the separate json files to csv for now, may at least improve reading performance on dataframe creation?
one more alternative may be to overload the tf.data.Dataset to create an own implementation. but that wouldn't really help performance in any way
might still try to write the separate json files to csv for now, may at least improve reading performance on dataframe creation?
reading CSV GREATLY improves performance, I can now read 700k datapoints into pandas, easily 😎
currently, only about 50k datapoints are used for training, since loading json into a pandas dataframes exhausts available RAM (even on google colab) quickly, so try to write a generator to batch-load data or something.
because the raw input data has to be preprocessed in pandas, maybe using dask dataframes instead of pandas may be the key to success. since tf.data doesn't come with a read_json helper natively and preprocessing the data in pandas is easiest (and already implemented)