Use Protocol Buffers to store data in a binary format

buchen commented 3 years ago

In the branch feature_protobuf I have drafted saving the data using Protocol Buffers. Users tend to have bigger files and it takes more and more time: the XML is relatively verbose and the conversion into Java objects takes time.

Therefore I did a prototype using protocol buffers. My thinking was that it is much better then rolling my own persistence, it has a clear understanding on backwards compatibility, and it provides language support for many other languages which might create new use cases in the future.

For now, I have only implemented the storing of securities and historical prices (as this takes up the bulk of the data), but the test is very promising. Take a look at the video. The file contains 720 securities and 1.3m historical prices.

https://user-images.githubusercontent.com/587976/127779337-a929c244-13fd-4eb4-ad59-91fcd363b83d.mp4

Has anybody experience with protocol buffers? I am wondering how to store some of the edge cases:

attributes can contain arbitrary Java objects - in practice those are mostly Strings, Doubles, Longs - how to best store this
transactions reference securities - I could use the UUID, but I could also use an index into the list of securities - pros and cons
for the historical prices, PP stores only the date and the close (but not high, low, volume). For the latest security price, PP stores all of those pieces of information. I am considering storing high, low, volume for all historical prices in the future. How would I model the Proto file so that it can be extended? At the moment, I have two separate message definitions.

Any help is appreciated.

funnym0nk3y commented 3 years ago

Just out of curiosity: Why do you think using protobuf is advantageous compared to let's say SQLite with an ORM?

As far as I understand portfolio loads all the data to RAM when it is opened, right? Couldn't this be avoided and thus the startup time reduced if the data was loaded as needed?

chirlu commented 3 years ago

Generally speaking, I like about XML that I can view and edit it without special tools (just text editor). E.g., I’ve already done search-and-replace operations on quote feed URLs when Ariva changed the structure. I understand protobuf does have a function to dump the contents of a file as text, and there may be some way to reverse this after changing, but it is in any case a much more complex operation. On the other hand, I don’t care about a few seconds at startup. So I’m not sure if I could see this as progress.

I am considering storing high, low, volume for all historical prices in the future.

Reminds me of some sayings to the effect of “software will always expand to eat up any speed/storage improvements provided by newer hardware”. ;-) What would be the use case for this additional data? Only thing I can see is more detailed charts, but chartists will still feel too limited by what PP can offer in that area.

lmb commented 3 years ago

I've attempted to read Portfolio.xml from Go to do some analysis, here are my 2c:

The use of XPath in the XML to refer to securities makes it hard to read in. I think this is what you mean with "arbitrary Java objects"? I would go with the most straight forward scalar protobuf types. Probably more work on the Java side.
Referecing securities: I would go with UUID. It's easy to decode securities into a map of UUID->Security and refer to that while decoding the protobuf. Indices will often overlap, so bugs in indexing will be subtle and hard to spot.
Since fields in proto3 are optional, you should already be able to store only (date, close) in a PLatestSecurityPrice. Depends a bit on what the generated Java implementation does.

buchen commented 3 years ago

@funnym0nk3y writes:

As far as I understand portfolio loads all the data to RAM when it is opened, right? Couldn't this be avoided and thus the startup time reduced if the data was loaded as needed?

Good question. Reading from protobuf is just a small incremental change that impacts only the writing and reading of data at the start and end of the program. Querying a database impacts a significantly larger part of the code base. At the moment, it does not feel like my limited time is spent well on such a huge refactoring.

BTW, @tfabritius is working on syncing the PP data into a database. Currently, it is a one-way sync: from the file into the database via a Portfolio Report API. For the client code, have a look at the Java package n.a.portfolio.online.portfolioreport.

@chirlu

Generally speaking, I like about XML that I can view and edit it without special tools (just text editor).

That is the reason why I do not plan to get rid of the XML. It should always be possible to save a file in XML format and read it from XML. My motivation is: users are only willing to input so much data if they can fully extract the data later on. CSV and similar exports can provide part of the data, yes, but the XML is the full picture.

@lmb writes:

The use of XPath in the XML to refer to securities makes it hard to read in.

Agree. With protobuf, it is all UUIDs now.

Referecing securities: I would go with UUID. It's easy to decode securities into a map of UUID->Security and refer to that while decoding the protobuf. Indices will often overlap, so bugs in indexing will be subtle and hard to spot.

This is how I do it at the moment - see client.proto.

tquellenberg commented 2 years ago

For me saving the protobuf file in combination with "COMPRESSED" is very slow (4 seconds for 9 MB) and saves only very little space (< 2 MB). Maybe you could change this setting: (name.abuchen.portfolio.ui.handlers.SaveAsFileHandler.execute(MPart, Shell, String))

pfalcon commented 1 year ago

As far as I understand portfolio loads all the data to RAM when it is opened, right? Couldn't this be avoided and thus the startup time reduced if the data was loaded as needed?

Good question. Reading from protobuf is just a small incremental change that impacts only the writing and reading of data at the start and end of the program. Querying a database impacts a significantly larger part of the code base. At the moment, it does not feel like my limited time is spent well on such a huge refactoring.

Note that initial database implementation doesn't have to be "querying", it could be "bulk-load" and "bulk-saving" just as any other existing storage backend.

Then, XStream library, as used for XML saving, itself supports binary mode, which supposedly also more efficient in space/processing terms than XML: https://x-stream.github.io/manual-tweaking-output.html#Configuration_Format

Then, between 2 obvious (?) choices:

A low-hanging solution, literally changes a few lines and use XStream's binary format.
Using SQL database, as a choice offering greatest interoperability.

- a third one was made:
Invest considerable effort to support one company's proprietary binary format.

Certainly as possible a choice as any other. Could have been a "golden middle" actually, the format not as proprietary as XStream's binary, and might have been less effort than DB library integration. Or maybe XStream binary is a relative novelty not available at this time. IMHO, it's still useful to have such a decision-making "postmortem" (might even help with future decision-making).

portfolio-performance / portfolio

Use Protocol Buffers to store data in a binary format #2363