Significantly faster df_to_geojson function for file output (8x)

murphy214 commented 6 years ago

Hey this should probably be a pull but some of the point simplification I'd guess from the geojson package does made it to much of pain to write tests. Anyway heres a file that implements df_to_geojson in effectively the exact same way but much faster (I think)

Nothing crazy is being done here just utilizing pandas methods to create the geometry string and then to_json off the dataframe for our properties wrapped in a list compression. Anyway I figure it could be useful at least to look at.

import mapboxgl.utils as old
import time
import pandas as pd
import random 
import json

################################################################################################################################################################################################
# THESE FUNCTIONS EXIST ONLY TO CREATE A POINT DATA SET

CHARS = 'abcdefghijklmnopqurstuvwxyz0123456789'
SIZECHARS = len(CHARS)

def random_int():
    return random.randint(0,10000)

def random_char():
    return CHARS[random.randint(0,SIZECHARS-1)]

def random_string():
    i = 0
    string = ""
    while i < 10:
        string+= random_char()
        i+= 1
    return string

def random_float():
    return random.uniform(0.0,10000.0)

# generates a number of random values
def generate_rands(number,func):
    newlist = []
    i = 0
    while i < number:
        newlist.append(func())
        i+=1
    return newlist

# 
def random_point():
    return [random.uniform(-180.0,180.0),random.uniform(-90.0,90.0)]

################################################################################################################################################################################################

def geometry(x):
    return '{"type": "Point","coordinates": [%s,%s]}' % (x.LONG,x.LAT)

def df_to_geojson(df, properties=None, lat='lat', lon='lon', precision=6, filename=None):
    """Serialize a Pandas dataframe to a geojson format Python dictionary
    """

    if not properties:
        # if no properties are selected, use all properties in dataframe
        properties = [c for c in df.columns if c not in [lon, lat]]

    for prop in properties:
        # Check if list of properties exists in dataframe columns
        if prop not in list(df.columns):
            raise ValueError(
                'properties must be a valid list of column names from dataframe')
        if prop in [lon, lat]:
            raise ValueError(
                'properties cannot be the geometry longitude or latitude column')

    if filename:
        with open(filename, 'w') as f:
            # Write out file to line
            f.write('{"type": "FeatureCollection", "features": [' +
            ','.join(['''{"geometry": %s, "type": "Feature", "properties": %s}''' % (geom,properties) \
                for geom,properties in zip(
                    df[[lon,lat]].apply(geometry,axis=1).values.tolist(),
                    df[properties].to_json(orient='records',lines=True).splitlines()
                                        )]) +
                ']}')

            return {
                "type": "file",
                "filename": filename,
                "feature_count": df.shape[0]
            }
    else:
        features = []
        df[[lon, lat] + properties].apply(lambda x: features.append(
            old.row_to_geojson(x, lon, lat, precision)), axis=1)
        return geojson.FeatureCollection(features)

# generating the dataframe to benchmark with
# 10k points 1 string,float, and int field respectively
number_of_rows = 10000

data = pd.DataFrame(generate_rands(number_of_rows,random_point),columns=['LONG','LAT'])
data['COL1'] = generate_rands(len(data),random_int)
data['COL2'] = generate_rands(len(data),random_string)
data['COL3'] = generate_rands(len(data),random_float)
print number_of_rows,'data created'

s = time.time()
for i in range(5):
    old.df_to_geojson(data,properties=data.columns[2:].values.tolist(),lon='LONG',lat='LAT',filename='a.geojson')
e = time.time() - s
opspeed1 = (e / 5.)
print 'secs / op: %s' %  opspeed1

s = time.time()
for i in range(5):
    df_to_geojson(data,properties=data.columns[2:].values.tolist(),lon='LONG',lat='LAT',filename='b.geojson')
e = time.time() - s
opspeed2 = (e / 5.)
print 'secs / op: %s' % opspeed2

print "%sx faster" % (opspeed1 / opspeed2)

'''
comapre values here
with open('a.geojson','rb') as f:
    data = f.read()
    oldfile = json.loads(data.replace('\n',''))

with open('b.geojson','rb') as f:
    data = f.read()
    newfile = json.loads(data)

for oldfeat,newfeat in zip(oldfile['features'],newfile['features']):
    # compare the old and the new here
    pass
'''

Output

10000 data created
secs / op: 4.89236383438
secs / op: 0.54987282753
8.89726422082x faster

ryanbaumann commented 6 years ago

@murphy214 nice speedup! I like that it still uses the to_json() method of the dataframe where appropriate for JSON serialization and list comprehension. How much memory does this approach consume versus the baseline method, which operates on one row in the data frame at a time in memory?

murphy214 commented 6 years ago

I'd have to profile it, as I'm not sure what exactly is going on with underlying dataframe methods but I think it would be a pretty safe assumption that the memory implications of this implementation are at least equal to the size of the geojson file itself as it is allocating the entire geojson string for all the features in memory.

That being said if you run into issues with the geojson file size in memory your most certainly going to have much more issues with the underlying dataframe that its derived from in memory. (i.e. if were having memory issues it shouldn't be represented in a dataframe period it needs to be taken into an out of memory structure)

pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset - Wes Mckinney

I guess to conclude you will see a memory spike but it should be a lot less than the underlying dataframe it represents (I'd think) and for most use cases won't matter (IMO).

ryanbaumann commented 6 years ago

I agree with you @murphy214 re: using an out of memory structure for larger dataframes.

Let's perform some quick tests to see what the real world memory impact is, i.e. does this effectively double/triple/etc. the memory-needs of a dataframe. My main concern here is that a dataframe could comfortably fit into memory, but may require several multiples of that memory footprint to hold an additional in-memory dictionary in geojson format.

One approach here could be to slice a dataframe into chunks of 1k rows at a time, and run through a loop of each chunk in-memory then write to file. That would also open up the opportunity to multithread the DF -> geojson operation to one chunk per thread, if the I/O speed to write each feature to disk is the bottleneck.

ryanbaumann commented 6 years ago

If you pull together the numbers for ☝️ @murphy214, please open a PR. Would love to see the speedup of this magnitude!

rutgerhofste commented 6 years ago

it might also be helpful to take a look at the geopandas package. I'm using geodataframes quite often for quick inspection of results.

mapbox / mapboxgl-jupyter

Significantly faster df_to_geojson function for file output (8x) #96

Output