Open murphy214 opened 6 years ago
@murphy214 nice speedup! I like that it still uses the to_json() method of the dataframe where appropriate for JSON serialization and list comprehension. How much memory does this approach consume versus the baseline method, which operates on one row in the data frame at a time in memory?
I'd have to profile it, as I'm not sure what exactly is going on with underlying dataframe methods but I think it would be a pretty safe assumption that the memory implications of this implementation are at least equal to the size of the geojson file itself as it is allocating the entire geojson string for all the features in memory.
That being said if you run into issues with the geojson file size in memory your most certainly going to have much more issues with the underlying dataframe that its derived from in memory. (i.e. if were having memory issues it shouldn't be represented in a dataframe period it needs to be taken into an out of memory structure)
pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset - Wes Mckinney
I guess to conclude you will see a memory spike but it should be a lot less than the underlying dataframe it represents (I'd think) and for most use cases won't matter (IMO).
I agree with you @murphy214 re: using an out of memory structure for larger dataframes.
Let's perform some quick tests to see what the real world memory impact is, i.e. does this effectively double/triple/etc. the memory-needs of a dataframe. My main concern here is that a dataframe could comfortably fit into memory, but may require several multiples of that memory footprint to hold an additional in-memory dictionary in geojson format.
One approach here could be to slice a dataframe into chunks of 1k rows at a time, and run through a loop of each chunk in-memory then write to file. That would also open up the opportunity to multithread the DF -> geojson operation to one chunk per thread, if the I/O speed to write each feature to disk is the bottleneck.
If you pull together the numbers for ☝️ @murphy214, please open a PR. Would love to see the speedup of this magnitude!
it might also be helpful to take a look at the geopandas package. I'm using geodataframes quite often for quick inspection of results.
Hey this should probably be a pull but some of the point simplification I'd guess from the geojson package does made it to much of pain to write tests. Anyway heres a file that implements df_to_geojson in effectively the exact same way but much faster (I think)
Nothing crazy is being done here just utilizing pandas methods to create the geometry string and then to_json off the dataframe for our properties wrapped in a list compression. Anyway I figure it could be useful at least to look at.
Output