Historical Prices - Githubissues

JamesKBowler commented 6 years ago

Hi @neka-nat , When calling the FXCM API for historical data the object is returned as a list of Python dictionaries. Do you think performance would improve if the object was returned as a Numpy array instead?

Cheers

James

neka-nat commented 6 years ago

Hi, James. In the source of C++, std::vector of Price struct is copied to the list of python. I do not think there will be much change in terms of speed even if I change it so that it copies it to the array of numpy.

Regards, neka-nat

JamesKBowler commented 6 years ago

Hi @neka-nat

I think this requires a little more discussion. Mainly because I am a newbie to C++ code and don't really have a clue where to start : )

If I was to 'play' around with the C++ code, is the below snipping of code the only section I would need to change? or are other parts of this API referencing the Price struct ?

struct prices_pickle_suite : boost::python::pickle_suite
{
    static boost::python::tuple getinitargs(const Prices& p)
    {
        return boost::python::make_tuple(p.mDate, p.mOpen, p.mHigh, p.mLow, p.mClose);
    }

    static boost::python::tuple getstate(boost::python::object obj)
    {
        Prices const& p = boost::python::extract<Prices const&>(obj)();
    boost::python::dict d = boost::python::extract<boost::python::dict>(obj.attr("__dict__"));
    d["date"] = p.mDate;
    d["open"] = p.mOpen;
    d["high"] = p.mHigh;
    d["low"] = p.mLow;
    d["close"] = p.mClose;
        return boost::python::make_tuple(d);
    }

    static void setstate(boost::python::object obj, boost::python::tuple state)
    {
        using namespace boost::python;
    Prices& p = extract<Prices&>(obj)();

        if (len(state) != 1)
        {
        PyErr_SetObject(PyExc_ValueError,
                ("expected 1-item tuple in call to __setstate__; got %s"
                 % state).ptr()
        );
        throw_error_already_set();
        }
    dict d = extract<dict>(state[0]);
    p.mDate = extract<boost::posix_time::ptime>(d["date"]);
    p.mOpen = extract<double>(d["open"]);
    p.mHigh = extract<double>(d["high"]);
    p.mLow = extract<double>(d["low"]);
    p.mClose = extract<double>(d["close"]);
    }

    static bool getstate_manages_dict() {return true;}
};

The reason I ask this question is that once the data is copied into a python list, I need to convert it to a structured Numpy array.

If the data was copied straight to a structured Numpy array there would be a performance enhancement on the python side of things.

For example, let's say we access a particular date, take the following code.

values = fxc.get_historical_prices(
                'GBP/USD',
                datetime(2017,9,21,12,00),
                datetime(2017,9,21,13,00), 'm1'
)

In [33]: type(values)
Out[33]: list

In [34]: type(values[0])
Out[34]: forexconnect.Prices

Numpy Array

dt = np.dtype(
    [('date', '<M8[us]'), ('askopen', '<f8'),
     ('askhigh', '<f8'), ('asklow', '<f8'),
     ('askclose', '<f8'), ('bidopen', '<f8'),
     ('bidhigh', '<f8'), ('bidlow', '<f8'),
     ('bidclose', '<f8'), ('volume', '<i8')]
)
arr = np.array([v.__getinitargs__() for v in values], dtype=dt)

%timeit -n 1000 np.array([v.__getinitargs__() for v in values], dtype=dt)
1000 loops, best of 3: 378 µs per loop
%timeit -n 100000 arr['date'][0].item()
100000 loops, best of 3: 1.19 µs per loop

Python Dictionary

dic = [v.__getstate__()[0] for v in values]

%timeit -n 1000 [v.__getstate__()[0] for v in values]
1000 loops, best of 3: 473 µs per loop

%timeit -n 100000 dic[0]['date']
100000 loops, best of 3: 59.9 ns per loop

Pandas DataFrame

pan = pd.DataFrame.from_records([v.__getstate__()[0] for v in values], index='date')

%timeit -n 1000 pd.DataFrame.from_records([v.__getstate__()[0] for v in values], index='date')
1000 loops, best of 3: 1.79 ms per loop

%timeit -n 100000 pan.index[0]
100000 loops, best of 3: 8.11 µs per loop

Python list comprehension really slows down the overall performance when creating/extracting the list from C++ code. I don't know what the timing would be on the C++ side, however, would be interesting to know. Is this a big job to change?

Hope my rambling makes sense.

Regards

James

neka-nat commented 6 years ago

Hi James. Thank you for the results. I'd like to confirm your point. Do you care about the time of array creation from the list or the time to access variables in the dictionary? Looking at the results, the array creation has a speed advantage of about 1.2 times(378 µs and 473 µs), but it seems to be slow for the variable access(1.19 µs and 59.9 ns). It can be understood that numpy array makes it more efficient, If you change it, you will have to change Price struct to numpy array in C++ source, so I think that it will be a little big change.

Regards, neka-nat

JamesKBowler commented 6 years ago

Thanks Neka, if the syntax is changed for variable access it improves, although still slower than a dict.

%timeit -n 100000 arr['date'][0]
100000 loops, best of 3: 272 ns per loop

Accessing one or two values is not really what I plan to do, I am trying to do most of the data cleaning via Numpy instead of Pandas. Loads of testing to do before I attempt to C++ code 👍

I will close this one off for now. Thanks again and I appreciate your response. James

neka-nat / python-forexconnect

Historical Prices #15

Numpy Array

Python Dictionary

Pandas DataFrame