inblockio / mediawiki-extensions-Aqua

This MediaWiki extension includes the Aqua implementation. Turning MediaWiki into a powerful versioned notary service with API's for import and export of Data in JSON format.
GNU General Public License v3.0
7 stars 6 forks source link

Include media for file export and import and include media hash in the verification procedure #14

Closed FantasticoFox closed 2 years ago

it-spiderman commented 2 years ago

Hashing media and ensuring integritiy of the page that transcludes other entities

Hashing files

To enable this we would need to make another table, file_verification. It would hold hash values of the files themselves. After this is set-up, we would hash the actual file content, and include that hash in the content hash for the file page.

This additional table is only needed to support files that are included directly. For only including file pages, this is not necessary, as in that case file hash would be baked into the page hash.

Integrity of the pages

If we include a file on the page (or transclude another page for that matter), we need to make sure that the page hash includes references to the versions of the included entities (files and pages). If PageA includes File:MyFile.png hash for PageA would be => hash( PageA->getContent(), getHash( File:MyFile.png ) ). Actually, with the solution below, this would be done automatically.

Storing included hashes

To keep track of included versions of the pages/files, we need a hash map. So PageA from previous example would have:

{
    "File:MyFile.png": "123...28x9" // Hash for File:MyFile.png at time of inclusion
}

To store this, we propose using another slot on the same revision (main slot holds the content, inclusion-hashes holds JSON hashes ) Big challenge here: We need to save both slots at the same time (in the same revision). This has never been done before. This hash map would be included in the content hash of the page.

Updating included entities

What happens when file/page that is included somewhere else changes? If we use the previous example, and upload a new version of File:MyFile.png, of course, hash of that page changes. If we go to PageA, we would still see the old version of the file (version that matches the hash stored), but also we would see a section that says that File:MyFile.png has a newer version. A diff link would be presented. If user decides to update to the new version of the included file, a new revision of PageA would be created, and has stored in the inclusion-hashes would be updated. New version of the file would now be visible on PageA.

Revision slots explanation

In later MW version, new implementation called Multi-Content-Revisions is available. This allows one page (and one revision) to have multiple separate slots that have data. This is perfect for this use-case, as all hash meta-data, including the signature, can be kept separate from the main content (revision slot main ). This way, we dont pollute actual content of the page with metadata. There are other advantages as well:

Although this technology is quite new and unknown, and would take some investigation to get it running properly, we feel like its the best option going forward, and is required to do most of the stuff in here as they are described.

Notes

FantasticoFox commented 2 years ago

I'll take the time to review tomorrow in length. Main question for me right now will be how do we integrate with our external-verifier which queries the content via API and recalculates the hashes to compare them (e.g. the command-line or chrome extension).

FantasticoFox commented 2 years ago

I read your article multiple times, this comment might be obsolete; we just need to ensure that a user can't delete resources embedded in pages, as it breaks the history verification, therefore the user needs to delete the pages before he can delete the resource; another solution could be to rebase the page and delete the history to resolve it for the user; But that should be an action with warning and requires a new workflow

Writing imperative from a protocol perspective: It is PROHIBITED to upload a file and take the same SLOT. Because this alters the History. We can't account for that. Any NEW uploaded file, needs to generate a NEW revision / name used as a unique resource to represent the new version of that file. That could be done by just iterating a version number. E.g. new version: .

The current behavior of MediaWiki seems inconstant with the commitment to do revision of pages which can be tracked. They have not followed up to generalist this behavior for embedded files. We need to do that to 'fix' the broken behvavior. Expected behavior for e.g. a Word document being edited and stored in MW will be a re-visioned file page which has all the old versions of the document which are all revision controlled.

If files are removed, it will and NEEDS TO fail in the revision verification, as we NEED to account for the history of the page. This is expected and wished behavior. If a user likes to 'prune' his history, he needs to create a new page and delete the old one. Which will solve his 'need' to start clean.

FantasticoFox commented 2 years ago

On Transclusion: we have a huge opportunity to solve this in a general manner, also for sub-pages. And this is what we urgently need to ensure correct behavior of mediawiki BUT also to prototype future functionality we are desiring.

The transclusion behavior should be uniform between embedded file-pages and sub-pages. This will reduce our work on the external-verifier side tremendously. We need to be able to to query and verify all those transcluded resources.

FantasticoFox commented 2 years ago

New content hash = content hash + transcluded ressource 1 + transcluded resource 2 (but then they need to be ordered in appearance with magic word). Then we go query and verify that resources as parallel requests.

FantasticoFox commented 2 years ago

This is a known limitation for v1.0 and will be added with v1.1

it-spiderman commented 2 years ago

Progress report: Made good progress today. Managed to find a way to store mulitple revision slots in one edit. With that done, we are now able to store hashes of all included files/pages. Started working on mechanism to include version based on its hash.

FantasticoFox commented 2 years ago

Well done, thanks!

On Mon, 6 Dec 2021 at 23:04, Dejan Savuljesku @.***> wrote:

Progress report: Made good progress today. Managed to find a way to store mulitple revision slots in one edit. With that done, we are now able to store hashes of all included files/pages. Started working on mechanism to include version based on its hash.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/inblockio/DataAccounting/issues/14#issuecomment-986859041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZWZU5JQDN7CEGIQ4C35ULUPTGGZANCNFSM45BNYOHA .

-- Tim Bansemer CEO / Geschäftsführer

inblock.io assets GmbH Humboldtstrasse 13 99423 Weimar, Germany Amtsgericht Jena HRB 514776 VAT-ID: DE319663515

it-spiderman commented 2 years ago

Progress report: By now, we have mechanism to "freeze" resource to the state as it was when it was verified. Working on alerts about new versions of included resources and hash tampering

FantasticoFox commented 2 years ago

Alerts about new versions and a way to 'update the resource to the current one' is very desirable.

Hash temparing is detected by the external verifier, this requires the ability to query content (picture, file etc) from the wiki via an API (like we query the wiki text via API). I don't think it's worthwhile to pursue trying to implement that inside the MWAccounting extension itself. In short, I believe it's a waste of time and we should not do it.

Please shoot over any questions. Maybe we need more clarification on the verification part?

If the page has a sub-page with a verification_hash we can detect that in the wiki-text from the external verifier. In version v1.1 of the protocol we will improve performance by detection by doing protocol changes.

On Tue, 7 Dec 2021 at 23:08, Dejan Savuljesku @.***> wrote:

Progress report: By now, we have mechanism to "freeze" resource to the state as it was when it was verified. Working on alerts about new versions of included resources and hash tampering

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/inblockio/DataAccounting/issues/14#issuecomment-988013120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZWZU6T2FFIJC2C3HDZHLLUPYPNJANCNFSM45BNYOHA .

-- Tim Bansemer CEO / Geschäftsführer

inblock.io assets GmbH Humboldtstrasse 13 99423 Weimar, Germany Amtsgericht Jena HRB 514776 VAT-ID: DE319663515

it-spiderman commented 2 years ago
Alerts about new versions and a way to 'update the resource to the current
one' is very desirable.

This is the thing im working on. Once that is done we should have a demo call. I do have some questions, but i guess we can discuss it then

it-spiderman commented 2 years ago

Added mechanism to update hashes in case there is a newer version of the included resources

FantasticoFox commented 2 years ago

Well done. Looking forward to talk later.

On Wed, 8 Dec 2021 at 23:08, Dejan Savuljesku @.***> wrote:

Added mechanism to update hashes in case there is a newer version of the included resources

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/inblockio/DataAccounting/issues/14#issuecomment-988897358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZWZU2P5RKUI33GYF7EQITUP5YGXANCNFSM45BNYOHA .

-- Tim Bansemer CEO / Geschäftsführer

inblock.io assets GmbH Humboldtstrasse 13 99423 Weimar, Germany Amtsgericht Jena HRB 514776 VAT-ID: DE319663515

it-spiderman commented 2 years ago

Status update - current state:

Additional implementations not directly related to requirements:

All changes are in poc_revision_slots (rebased today to dev-for-v1.1

FantasticoFox commented 2 years ago

Well done! Looking forward to doing the review with RHT on Friday!

On Wed, 15 Dec 2021 at 21:44, Dejan Savuljesku @.***> wrote:

Status update - current state:

  • Implementation for hashing and controling displayed versions of the included resources is pretty much completed (not only on PoC level, but on final level)
  • Files are also hashed, and included into the page content (of the file page), and included into content_hash
  • All data of a revision (all slots) are exposed over the API Additional implementations not directly related to requirements:
  • Created services and entity wrappers for database objects (all interaction with revision_verification table is done over the single service, and database rows are wrapped in an object) => this is done and ready, but not used in all places of the code, some parts still use the old way of accessing the data
  • Revised the get_revision API endpoint and refactored it to use the new services (can be used as example for others)
  • All functionality from Util and ApiUtil is now in VerificationEngine, making those two files deprecated, but still available since most of the code still uses them.

All changes are in poc_revision_slots (rebased today to dev-for-v1.1

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/inblockio/DataAccounting/issues/14#issuecomment-994804765, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZWZU5HWFIN6RRUTMY7OYLURCLTHANCNFSM45BNYOHA .

-- Tim Bansemer CEO / Geschäftsführer

inblock.io assets GmbH Humboldtstrasse 13 99423 Weimar, Germany Amtsgericht Jena HRB 514776 VAT-ID: DE319663515

it-spiderman commented 2 years ago

There is one issue that i couldnt come to a conclusion on (from logical perspective): Transclusion hashes hold (among others):

FantasticoFox commented 2 years ago

I think the solution is simple:

1) LINK to specific revision of the page. Which means if it changes we don't care as we are related to a specific revision. Like you link to a specific rev_id.

2) User is informed when visiting the page that a new version of the page is available and if he likes to transclude it. "But that seems a bit weird, as the user will see that there is a new version of the resource, but nothing will change after the update (as the page was only signed)" is the correct behaviour. If they visit the page and check the revision they see that there is a newer version which was signed. I think this can be solved in the future by clarifying the UI but is nothing I'm worried about right now.

Let's speak about this today if this is not sufficiently clear.

On Thu, 16 Dec 2021 at 16:40, Dejan Savuljesku @.***> wrote:

There is one issue that i couldnt come to a conclusion on (from logical perspective): Transclusion hashes hold (among others):

  • genesis_hash => as "title" identifier,
  • content_hash => used to check if the content itself changed in the internal content control mechanism,
  • verification_hash => will be used to confirm validity of the page in the verifiers. this is the problematic part. We can only include the hash at the time when the resource was included. However, that hash can be changed (though normal use), like on signing the resource page. Im not sure what do to then. Verification hash will no longer match and page will fail validation. Signing does create a new revision, which will prompt user where the resource is included to update to the latest version, in which case all is good again. But that seems a bit weird, as user will see that there is a new version of the resource, but nothing will change after the update (as page was only signed) Only solution that would work as i want is to update verification hashes on every page that includes the resource, once its signed. But, of course, since resource can theoretically be included in unlimited number of pages, this is not a good idea.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/inblockio/DataAccounting/issues/14#issuecomment-995556784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZWZUYWI7TE22T6K3LEKVDURGQY7ANCNFSM45BNYOHA .

-- Tim Bansemer CEO / Geschäftsführer

inblock.io assets GmbH Humboldtstrasse 13 99423 Weimar, Germany Amtsgericht Jena HRB 514776 VAT-ID: DE319663515

FantasticoFox commented 2 years ago

Basic functionality works; New issues will be opened for minor issues.