portfolio-performance / portfolio

Track and evaluate the performance of your investment portfolio across stocks, cryptocurrencies, and other assets.
http://www.portfolio-performance.info
Eclipse Public License 1.0
2.86k stars 588 forks source link

Towards more interoperable native XML format, was: Why XStream.XPATH_RELATIVE_REFERENCES was used to use save Portfolio Performance XML? #3417

Open pfalcon opened 1 year ago

pfalcon commented 1 year ago

Problem: with this mode, object references get recorded as XPath expressions relying on particular index an object has in an XML file, e.g.:

          <amount>102188</amount>
          <security reference="../../../../../securities/security[30]"/>
          <crossEntry class="buysell">

This is unfriendly to both users looking into the file and tools intending to read, and especially write PP XML files. For example, for user, it's almost impossible to tell which security gets actually referenced in the snippet above (would take searching thru the XML to see which <security> tag appears 30th in the file, who in their right mind would do that?). Of course, it would be almost impossible to add a new info to the file: adding something at the wrong place would wreak havoc of the file structure; even if a user is careful to add something at the end of a corresponding section, it's almost impossible to cross-reference it around.

Even for automated tools, it takes extra processing to resolve such references. Doing a quick survey of such tools, I see that some actually try to resolve such refs as XPath expressions, but other do some adhoc processing (e.g. look for "securities/security[N]" patterns and then look for Nth security in already collected list).

This ticket is intended to make this problem explicit, and collect evidence that it's, of course, a known problem (all that to hopefully give interested parties hints how to go about it).

pfalcon commented 1 year ago

All the issues above could have been rectified by using XStream.ID_REFERENCES instead, which uses XML as, well, the XML. XStream documentation is very explicit about it: https://x-stream.github.io/graphs.html#ids :

Both modes displayed until now are not so easy to write by hand. XStream has another mode which makes it is easier to read/write by a human being: xstream.setMode(XStream.ID_REFERENCES);

pfalcon commented 1 year ago

Some evidence that this is a know problem:

In discussion of https://github.com/buchen/portfolio/issues/2363 :

https://github.com/buchen/portfolio/issues/2363#issuecomment-897654963

The use of XPath in the XML to refer to securities makes it hard to read in.

https://github.com/buchen/portfolio/issues/2363#issuecomment-903223593 :

Response from @buchen (PP author):

Agree. With protobuf, it is all UUIDs now.

So again, the problem is known, understood and rectified in ProtoBuf saving format.

pfalcon commented 1 year ago

Given the question in the ticket title is likely rhetoric, I'll try to speculate on it myself: the only explanation why PP uses XStream.XPATH_RELATIVE_REFERENCES is that it's default XStream's format, period. And the likely resulting XML was never validated against some set of requirements. I.e. XStream library wasn't chosen because it allows to store PP information in easy to use and interoperate with XML format, it was chosen just as a Java persistence library, which just by a chance uses XML format.

All this could have been a bit different if there was a specific requirement to produce XML with which other tools could easily interact (and that's why tickets like https://github.com/buchen/portfolio/issues/3303 are shyly being posted, suggesting that it might be possible to think ahead, or at least again, of the choices made).

pfalcon commented 1 year ago

Given the above hypothesis (that default XStream option was chosen ~ randomly), some ideas can be drafted what to do about it:

  1. PP already has an option to export currently open file as XML (useful for cases when open file is stored in ProtoBuf format). Wouldn't be to hard to add a choice to export it in "XML with id attributes" format, for easier consumption by 3rd-party tools.
  2. Actually, if the issues with the XPATH_RELATIVE_REFERENCES format is acknowledged (and it was acknowledged by the author), them migrating to XStream.ID_REFERENCES wouldn't be impossible. Just bump version number and save as ID_REFERENCES. But my understanding that the XStream library will read only the format which is configured (won't recognize any and load it anyway). Then, can do: try to load with the ID_REFERENCES (i.e. new) format initially. With the old format, there would be either error loading, or worst-case, it loads, but with references not existing. Still would be enough to check format version and reload with XPATH_RELATIVE_REFERENCES (will be converted automagically on the next save).
pfalcon commented 1 year ago

Another funky artifacts of XStream XML:

  <accounts>
    <account>
      <uuid>d18cef06-ebb6-4e14-8cad-fe7410d2e6b6</uuid>
      <name>Konto 1</name>
      <currencyCode>EUR</currencyCode>
      <isRetired>false</isRetired>
      <transactions>
        <account-transaction>
          <uuid>6bfff0a7-f0e9-4523-9deb-70e4ccc71ab3</uuid>
          <date>2018-01-01T00:00</date>
          <currencyCode>EUR</currencyCode>
          <amount>3500000</amount>
          <shares>0</shares>
          <updatedAt>2023-06-19T19:08:53.180664775Z</updatedAt>
          <type>DEPOSIT</type>
        </account-transaction>
        <account-transaction>
          <uuid>d471dac0-701e-4655-a1d0-18057748e8be</uuid>
          <date>2018-01-02T00:00</date>
          <currencyCode>EUR</currencyCode>
          <amount>102188</amount>
          <security reference="../../../../../securities/security[30]"/>
          <crossEntry class="buysell">
            <portfolio>
              <uuid>c8676f98-3337-4c64-a6c2-0974788790bb</uuid>
              <name>Depot 1</name>
...
  <portfolios>
    <portfolio reference="../../accounts/account/transactions/account-transaction[2]/crossEntry/portfolio"/>
  </portfolios>

I.e. an object (portfolio here) is output in full on a first reference. Following mentions use the reference to that first full repr. Totally understood and correct from a dumb algorithm point of view, but clearly not what a human (or a script written by a human) wants to see. (Expected: all portfolios are defined in <portfolios>, and mentions in xacts, etc. use references.)

buchen commented 1 year ago

@pfalcon

the only explanation why PP uses XStream.XPATH_RELATIVE_REFERENCES is that it's default XStream's format, period.

I agree that the format in a way just happened 11 years ago

I would be willing to change this, if the program can continue to read both formats (with relative references and id references) and it has no significant impact on reading performance and file size.

At a quick glance, one could use

 xstream.setMode(XStream.XPATH_ABSOLUTE_REFERENCES);

which seems to be able to read the existing files with relative references, but writes them with absolute references. However, that does not help that the portfolio structure are written at first occurrence. I think this approach does not provide any benefits.

So that would mean implementing a fallback such as outline above

Actually, if the issues with the XPATH_RELATIVE_REFERENCES format is acknowledged (and it was acknowledged by the author), them migrating to XStream.ID_REFERENCES wouldn't be impossible. Just bump version number and save as ID_REFERENCES. But my understanding that the XStream library will read only the format which is configured (won't recognize any and load it anyway). Then, can do: try to load with the ID_REFERENCES (i.e. new) format initially. With the old format, there would be either error loading, or worst-case, it loads, but with references not existing. Still would be enough to check format version and reload with XPATH_RELATIVE_REFERENCES (will be converted automagically on the next save).

I think that would work. The only issue I see that the bulk of the data (security + prices) is read before the first relative reference is read. Hence the reading fails pretty late. On the other hand, it typically will happen only once.

I am happy to entertain a pull request if you want to make a (draft) proposal

Generally, I want to move to more use of the protobuf format. It is much faster to read and write, uses less memory, and can be used in other languages as well.

buchen commented 1 year ago

From your comment in the other thread, I also understand that there are already a number of tools that work with the existing relative references. Of course, such an incompatible change would also break the other tools.

pfalcon commented 1 year ago

Thanks for the reply!

I would be willing to change this, if the program can continue to read both formats (with relative references and id references) and it has no significant impact on reading performance and file size.W

Well, so far I just tried to change File -> Export -> Portfolio Performance XML to use ID_REFERENCES format. One thing to immediately notice is that the file is bigger (for the dax.xml: 1091104 -> 1282670 bytes, 17.5% increase). That's because XStream outputs the "id" attribute for every object output (regradless if it's ever referenced or not). And given objects like <price>, that adds up quickly (even though the references themselves are shorter, reference="17110" vs reference="../../../../../securities/security[30]").

Intuitively, using id's would be also slower, because certainly array access is faster that using a map. But of course, I didn't time it, and realistically, considerations like above may be dwarfed by XML parsing and XPath handling.

xstream.setMode(XStream.XPATH_ABSOLUTE_REFERENCES);

I don't think that would address concerns mentioned in the original description, again, the problem is that "here we refer to security with relative position 15" isn't much human-friendly, and relative vs absolute XPATH refs don't change it. (Whereas with ID_REFERENCES, "security with id 123" is just a simple whole-doc search to find out what that security is.)

I am happy to entertain a pull request if you want to make a (draft) proposal

Thanks for being positive about it. Well, I'm trying to think/plan ahead first (btw, I did search here Github project and forum for XPATH_RELATIVE_REFERENCES / ID_REFERENCES and didn't find anything, why creating this ticket). I otherwise have very little time for PP so far :-(, the more so need to plan what would be the best sequence of hacking to perform on it ;-).

However, that does not help that the portfolio structure are written at first occurrence.

Yes, addressing that (as far as I can tell) would require only deeper changes to XML saving, roughly saying, making it more relational tables like. I.e. output first accounts (their properties, but not xacts), then portfolios (ditto), and then separately transactions under a new top-level tag like <transactions>. Dunno even if XStream can do that, and again, if so, would require a few code changes.

From your comment in the other thread, I also understand that there are already a number of tools that work with the existing relative references. Of course, such an incompatible change would also break the other tools.

Yes. But the impression I got is that those tools/libs are also very rough and far from being complete. So, it mostly boils down whether there's a desire to make the XML to be interchange/interoperability format, not just internal storage format as it is now. If so, it would make sense to consider/plan a number of changes to make it more easily human/machine readable, then make the changes, then invite all the interested 3rd-parties to develop tools for now officially interoperable XML format.

Thanks again for the discussion, I'll keep that in mind, and medium-term (1-2 years), I guess it may be worth doing it.

pfalcon commented 1 year ago

Also noticed that there're <account-transaction> vs <accountTransaction> and <portfolio-transaction> vs <portfolioTransaction> %).

pfalcon commented 10 months ago

I generally try to search previous bugreports, but managed to miss that suggestion to drop XPATH_RELATIVE_REFERENCES was reported before: https://github.com/portfolio-performance/portfolio/issues/2195

pfalcon commented 2 months ago

Patchset to implement references using XML id attributes is posted: https://github.com/portfolio-performance/portfolio/pull/4117