mmozeiko / pkgi

pkg download & installation directly on Vita
The Unlicense
249 stars 162 forks source link

Using YAML as data format #18

Open xy2i opened 6 years ago

xy2i commented 6 years ago

If the pkgi.txt file is ever meant to be edited (to which I assume it will be), then using YAML seems like a good choice. The main advantages for YAML are that it is much more readable, allow not including entries, and don't add a lot of 'overhead' for file size. Here is a simple comparaision:

CSV format (original):

UP2089-PCSE00582_00-ADVENTURETIMEPAK,0,Adventure Time,,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg,371557824,a5d40400375659b619391128745d0aa419dea15149b276cc696577dc76b329ac
EP0082-PCSB00975_00-ADVENTURESOFMANA,0,Adventures of Mana,,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,http://zeus.dl.playstation.net/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg,532326688,387845c6100dcf12be914220e246cbf5c227c12c79d686f8231fc3d166c85f0f
JP0082-PCSG00759_00-SEIKENFFGAIDENRM,0,Adventures of Mana,聖剣伝説FF外伝,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,http://zeus.dl.playstation.net/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg,375289280,566642fbfe8b4c9ac2f5690b001e61b7bca609a3a5ef94b22b04e8d19c30e0c4

YAML:

- contentid: UP2089-PCSE00582_00-ADVENTURETIMEPAK
  flags: 0
  name: Adventure Time
  zrif: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  url: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg
  size: 371557824
  checksum: a5d40400375659b619391128745d0aa419dea15149b276cc696577dc76b329ac
- contentid: EP0082-PCSB00975_00-ADVENTURESOFMANA
  flags: 0
  name: Adventures of Mana
  zrif: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  url: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg
  size: 532326688
  checksum: 387845c6100dcf12be914220e246cbf5c227c12c79d686f8231fc3d166c85f0f
- contentid: JP0082-PCSG00759_00-SEIKENFFGAIDENRM
  flags: 0
  name: Adventures of Mana
  name2: 聖剣伝説FF外伝
  zrif: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  url: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg
  size: 375289280
  checksum: 566642fbfe8b4c9ac2f5690b001e61b7bca609a3a5ef94b22b04e8d19c30e0c4

Note how with YAML, the item doens't need to be included if it's empty, here name2. Here you could also remove the flags entry when it is 0 (blank). Taking advantage of these tricks we end up with a file that has very close filesize to the original csv (using input with about 900 entries):

$ du -b pkgi.yaml pkgi.csvoriginal 
387951  pkgi.yaml
340393  pkgi.csvoriginal

which amounts to a ~12,2% filesize increase from the transformation to YAML.

mmozeiko commented 6 years ago

What's wrong with editing CSV files? 1) load them into Excel/LibreOffice/Google Sheets, 2) edit, 3) export back to csv.

xy2i commented 6 years ago

What's wrong with editing CSV files?

In European regions, Excel will export CSV files by default not with commas but with semicolons. This is because commas are commonly used in European numbers (eg. 12,34 instead of 12.34.) In general, I think editing a file directly with whatever text editor is much more palable than having to open an huge office suite or slow web page every time you want to make small edits to your database. The workflow is the same with an editor, except it's much easier! You could argue you can also edit the csv with a simple text editor, but even right now the db is barely editable, compared to the yaml version which is much easier to read (and more flexible; I'll expand on this later), so I expect it to become completly unreadable the moment more entries are added. What if you are only on the Vita (no computer) and need to edit the db to fix a link?

To further my point here, let's compare further CSV with YAML:

Parsability

The only advantage I can attribute csv here is that it is smaller by yaml, by nature of its simplistic format. But this also lends itself to some flaws, in particular with the types of strings you have to deal with on the Vita (game names). Also, even the current parsing has bugs: #19

YAML is standard, so it has implementations for parsing in quite a lot of languages. Also, it facilitates parsing should the database ever "evolve" beyond pkgi itself; for example, a web site that automatically stores and updates the database..

Design

The way csv is structured leads itself to several design flaws. In this case I can point out a simple one: the "name2" item (name_org, aka. original name in the code) is flawed, because the "name" entry represents the original name of the game if alone, or, if the game name is not alphabetic (A-Z, a-z), then it is "translated" to an alphabetic name in the "name" entry and "name2" then becomes the original name. What if, instead of this dual function, "name" simply always contained the original name, and "name2" only contained the alternative name? Then "name" could always be a consistent entry, and name2 could be used when it exists. Except this cannot be solved reliably with the current CSV implementation because it would break the parsing.

Another example: how do you handle games that are the same (same name, etc) but have different regions? How do you even find the region of a game? CSV has no answer for this, and so the solution in the code is a parse of the titleid to figure out the region.

What if I add some new item, but decide to remove it later due to obsolete functionality? Good luck with CSV: now I have one extra colon to add to every line, and each entry becomes further mind gymnastics as manual editing becomes even harder and the database gets bigger.

What about YAML?

So, why did I suggest using YAML instead of CSV? Simply, because it solves all these problems, while managing to keep a good size (I think with some actual effort it can be made even smaller than my lazy conversion).

Let's tackle the problem I mentioned earlier about regions. As mentioned in this discussion, most of the work resides on the database side. Take my example database here; let's make something that looks much better and is easier to parse as well! (Note that, even if you take the same implementation as the CSV, eg. an entry for each item, you can still use node anchors and references to omit the repeating data, which leads to smaller size overall.)

- contentid: 
    JPN: JP0082-PCSG00759_00-SEIKENFFGAIDENRM
    EUR: EP0082-PCSB00975_00-ADVENTURESOFMANA
  name: Adventures of Mana
  name2: 聖剣伝説FF外伝
  zrif: 
    JPN: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    EUR: bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  url: 
    JPN: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/japan.pkg
    EUR: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/europe.pkg
  size: 
    JPN: 375289280
    EUR: 532326688
  checksum: 
    JPN: 387845c6100dcf12be914220e246cbf5c227c12c79d686f8231fc3d166c85f0f
    EUR: 566642fbfe8b4c9ac2f5690b001e61b7bca609a3a5ef94b22b04e8d19c30e0c4
  author: Square Enix

In fact, here I can add new items easily, and I can omit some optional items if I want (such as, say, game description, or author, or all kind of lesser data.) With this design, data can be differentiated for multiple regions, or just put as a single entry if needed.

fennectech commented 6 years ago

Why not use a tool like csved? Though i must say yaml makes more sense for DLC.

http://csved.sjfrancke.nl

As for your worries about linux users it runs great in wine! snappy tool that is freeware (not foss though if that is a concern) on something like a vita this 12 percent reduction in size is significant.

We could impliment off the shelf gzip compression of the yaml file to get our 12 percent back though!

Simply put the .yml file into a gzip container and then decompress this file on demand. we could easily use theflow's rar implimentation also (or zip but its less efficient.)

https://github.com/TheOfficialFloW/VitaShell/tree/master/unrarlib

mmozeiko commented 6 years ago

Ok, and why yaml, and not toml, json, xml, or just go full sqlite3?

Joining games under same name probably won't happen though.

fennectech commented 6 years ago

Yaml is braindead simple to edit and has sane layout that an actual human can easily edit. Basiclfy it looks pretty. thats one reason to use yaml. and sqlite3 is a huge pain. It requires quite a bit more work to impliment unless you use an existing library.

xy2i commented 6 years ago

Why not..

SQLite

As stated above, sqlite is a pain. Also, it misses the point of 'plaintext format' and 'easy to edit' - currently, the database has the useful property of being very easily shareable over a plaintext medium.

XML

XML has quite a lot of overhead. In general, while XML is more of a markup language, YAML is a data format, which doens't fit the use case here.

TOML

From the page itself:

Be warned, this spec is still changing a lot. Until it's marked as 1.0, you should assume that it is unstable and act accordingly.

JSON

This is the other choice in contrast. JSON is very easy to parse, fits the needs (readability, plain-text, resistant to delimiter collision), has been around for a long time, and takes about the same size as YAML. I chose YAML here because of its easier readability, but it comes down to:

So, by looking more into it, it does seem that JSON should be more suited; if my conclusions are wrong, please rebute them.