performance - Githubissues

hershaw commented 8 years ago

The following was run on a MacBook with 2 cores and 8 GB of memory with all applications closed.

(benchmarks)benchmarks master > ./run-benchmarks.sh 
sys:1: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
pandas read csv: 11.1477160454s
pandas apply transforms: 0.938997983932s
2016-04-15 13:10:16,467 [INFO] sframe.cython.cy_server, 172: SFrame v1.8.5 started. Logging /tmp/sframe_server_1460722216.log
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,float,float,float,float,str,str,float,str,str,str,str,str,float,str,str,str,str,str,str,str,str,str,str,float,float,str,float,float,float,float,float,float,str,float,str,float,float,float,float,float,float,float,float,float,str,float,str,str,float,str,float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Unable to parse line "Loans that do not meet the credit policy,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"
Read 89872 lines. Lines per second: 35505.4
Read 426460 lines. Lines per second: 56880.2
1 lines failed to parse correctly
Finished parsing file /Users/samuelhopkins/cp/benchmarks/data/lc_big.csv
Parsing completed. Parsed 756878 lines in 12.2473 secs.
sframe read csv: 17.1927471161s
sframe apply transforms: 16.9669880867s
node apply transforms: 466.212ms

As you can see, applying the operations in pandas take about 1s, using sframe about 18s, and in node.js about 0.5s.

The performance difference between pandas and sframe is probably due to the fact that I can use the native pandas functions isin and map which I am guessing are highly optimized while with sframe I am simply using apply which is being supplied with pure python functions.

However, I can confirm that sframe is using all cores which leads me to believe that if I can perform my single operations more efficiently, I should see better results.

Node.js is the winner so far, but it's scalability and predictibility is a bit limited so we are willing to take a few milliseconds hit if we can get something robust that uses all cores.

I guess the question here is: 1) Did I miss something in the documentation for sframe that provides equivilent pandas map and isin functionality and 2) If not, how can I optimize the given operations?

ylow commented 8 years ago

1: The sframe wide access in

sf[sf.apply(create_drops_filter(drops))]

is slow.

Lambda Accesses should be as narrow as possible. Here is a simpler implementation I think.

def apply_transforms(sf, transforms):
    drops, combines = split_drops_combines(transforms)
    for drop in drops:
        sf = sf[sf[drop['name']].apply(lambda x: not x in drop['payload'])]
    for c in combines:
        newval = '_'.join(map(str, c['payload']))
        sf[c['name']] = sf[c['name']].apply(lambda x: newval if x in c['payload'] else x)
    return sf

The lack of a native "isin" should be fixed. There is SArray.contains but it only works for substring matches. Fixing it and making it more general will speed up the "drop" stage.

A native generalization of the "combines" operation seems plausible I will think about it a bit.

2: Lambda's work by spinning up subprocesses. "preloading" the subprocesses will shave a little bit.

 if __name__ == '__main__':
    # preactivate the lambda subprocesse
    sframe.SArray([1,2,3]).apply(lambda x:x+1)
     main()

3: You probably want a sf.materialize() before ending the timer. I am putting in a missed optimization that will make the whole drop+combine process lazy.

ylow commented 8 years ago

This should bring down the sframe apply transformations by 2-3x. It is still not as fast as pandas though. This kind of perf benchmarks is hard to tune for, especially in particular since SFrame is optimized for memory first.

For instance, if you load lc_big in sframe and see the memory usage of python, I get about 103MB. While in Pandas, I get about 533MB. i.e. you will be able to end up a lot more data in SFrame. (we will automatically start going out to disk/SSD too once we start to consume more than half your memory)

hershaw commented 8 years ago

Thanks! I've incorporated the suggestions which got a very good speedup and am now testing on machines with different amounts of resources closer to what production would be. Will let you know how those run soon.

2 quick questions:

materialize doesn't seem to be a method on the sframe so I'm calling len(sf) which seems to be doing the trick, is that okay?
What's the best way to get jsonable output? Is there a serializer for SFrame or SArray? Would be nice to have an easy way to apply a mask or compute some stats and json serialize the output for the frontend.

hershaw commented 8 years ago

Okay, did some preliminary testing with more cores and the results are promising. I'm going to add in some more use cases and organize things a bit more, then re-run on 8, 16, and 32 core machines and see where we get with things.

hershaw commented 8 years ago

@ylow I've added a more complete use case and got some results on 8, 16, and 32 core google compute engine machines that we use in production and I'm still having a tough time optimizing for our use case.

In the results section on the repo there are the full stats. I think not all of the individual timings are correct for sframe because I haven't found the right way to materialize.

The upshot is that sframe is still quite a bit slower than pandas for our use case but I think it's because I'm relying heavily on apply. What's the best way to speed this up? Should I use the GraphLab create SDK to add the methods I need? I don't know C++ so that sounds like a long haul. Maybe there's a cython solution that would be a bit easier?

I really want to be able to utilize all cores available and I think sframe is my best chance...

hershaw commented 8 years ago

Or maybe we just need to tell SFrame to go ahead and use all the memory it wants?

hershaw commented 8 years ago

Fixed the stats calculations and performance on the gce machines is now really good. Am adding a few more benchmarks but I think I'm getting the hang of it now and it's starting to look pretty good.

I removed calls to len() that were (sometimes) forcing materialization and it improved the overall run. I'm thinking that we might be able to just embrace the lazyness and build our app around it.

Will be posting new results soon.

ylow commented 8 years ago

Hi, Sorry, did not respond over the weekend. So, I noticed some optimizations that you can put in. we also have a builtin sarray.astype and also an sarray.str_to_datetime (takes the same syntax as ptime) which should be faster.

ylow commented 8 years ago

We have some missed optimizations around sequences of apply calls as well. I think I can look into implementing those. (as in optimizations in SFrame)

hershaw commented 8 years ago

No worries at all!

In any case, we couldn't use the str_to_datetime because some of the entries may be mal-formed or null and it always throws is this case. We need it to be filled in with null instead.

Same thing with the astype(float). In our datasets, casting to a float or int has an unfortunately broad definition hence all of the try/catches. Probably not much we could do in that case without using the SDK.

ylow commented 8 years ago

I am thinking about what kind of primitives I can add to the SArray to minimize the needs for a lambda / sdk function.

1: astype could take a failure value. i.e. astype(float, value_on_vailure=xxx). While this may not work exactly for your case, it could be helpful. ditto for str_to_datetime

2: The "contains" function should work for lists, arrays and dictionaries.

3: A Ternary operator.

condition.iff(true_value/true_vector, false_value/false_vector)

For each element, returns the equivalent of condition ? true_value : false_value

hershaw commented 8 years ago

I'm thinking that the behavior for astype() with float or int is fine. Our use case is really specific and I think that having strict requirements and reliable behavior with exception throwing is actually okay. If something can't be parsed into a number, that's just that (especially since None is handled gracefully). If anything, maybe a null_on_failure option is a good idea, but at least in our datasets, that would result in all-null features.

However, for str_to_datetime, we might benefit a lot from having a null_on_parse_failure option because we have lots of cases where portions of a series have empty strings or nullish values. Or maybe it's better to have a repwith() that takes a dictionary and replaces all matches to keys with their corresponding value. That way we could just replace all empty strings of bad values with None and then call str_to_datetime and everyone is happy.

A native contains() sounds awesome :)

How would the condition operator work? It would still have to execute python code, wouldn't it?

ylow commented 8 years ago

Also, maybe a reverse contains... that allow you to do a per-element test if each element is in a list. like sarray.is_in(payload) instead of sarray.apply(lambda x: x in payload).

The idea of the condition operator is

# hypothetical numeric length
newval = 0
sarray.apply(lambda x: newval if x > 10 else x)
# Alternative
(sarray > 10).iff(newval, sarray)

Syntax is a little quirky. Not sure how to make it nicer.

hershaw commented 8 years ago

Yes, the is_in() would be great. We would use it for a few things in the near future i.e. cleaning outliers.

Maybe the ternery api could be sarray = gl.tern(sarray > 10, trueval, [falseval]) and falseval defaults to the original value?

ylow commented 8 years ago

Yeah. That looks like numpy.where. Maybe I will take their syntax.

hershaw commented 8 years ago

Ah I see. Looks like the numpy.where is much more general and requires the last few arguments to be array-like. Since SArray creation is a bit expensive it might be good to offer a way to avoid it. You could end up with:

gl.where(cond, [[ontrue], onfalse]) where ontrue and onfalse can be an SArray or a primitive value of the appropriate dtype. ontrue would default to None and onfalse would default to the original value of the SArray entry.

Still needs a bit of thought, but we might be onto something here.

ylow commented 8 years ago

Array-like is fine (Rather, SArray like in this case). Then we can write stuff like:

gl.where(sa <= 10, sa, 10) # truncate all values at 10

hershaw commented 8 years ago

Array-like or primitive for both the 2nd and 3rd args?

ylow commented 8 years ago

Yeah. I am thinking about how to implement this efficiently.

hershaw / benchmarks

performance #1