generated xml is not stable

ietf-tools / bibxml-service

Django-based Web service implementing IETF BibXML APIs

https://bib.ietf.org

BSD 3-Clause "New" or "Revised" License

17 stars 19 forks source link

generated xml is not stable #241

Closed rjsparks closed 2 years ago

rjsparks commented 2 years ago

See https://mailarchive.ietf.org/arch/msg/tools-discuss/yRqq-psSVnB-qoNbICFvzZQDmIo.

While attributes and unordered-element order doesn't matter to XML parsers, it matters a lot to text differencing engines.

The output should be stable (repeated fetches should return identical bits, not seemingly randomly reordered ones).

rjsparks commented 2 years ago

(this issue should probably be migrated to bibxml-service.

ronaldtse commented 2 years ago

Transferred and acknowledged.

https://github.com/ietf-ribose/bibxml-service/issues/234 will address this issue.

strogonoff commented 2 years ago

We can canonicalize returned XML, but I’d like to learn more about the new use case being described.

For example, if there is XML diffing being done, isn’t it the responsibility of the differ to ensure all sources being diffed are canonicalized using the same method, rather than relying on API returning canonicalized XML and hoping it’s canonicalized with the same set of options?

strogonoff commented 2 years ago

For example, the differ in xml2rfc path testing script (created for transitional purposes) now does canonicalization as follows: https://github.com/ietf-ribose/xml2rfc-mapping-convertor/commit/6564383e0a6363f59a2e44fc7c22c641f5e70921

ronaldtse commented 2 years ago

Thanks! Can we have a sample to see how the canonicalized content looks?

While attributes and unordered-element order doesn't matter to XML parsers, it matters a lot to text differencing engines.

Text differencing engines should really not be relied upon for structured data content, but I recognise that in certain conditions, e.g. git, it is unavoidable.

strogonoff commented 2 years ago

Yes: you can use the latest version of test_paths.py script, which produces a report with canonicalized diffs like bibxml-report.html.zip.

strogonoff commented 2 years ago

Text differencing engines should really not be relied upon for structured data content, but I recognise that in certain conditions, e.g. git, it is unavoidable.

Well, as you see in above report, our diffs with xml2rfc are very large since we pretty-print XML, while xml2rfc tools outputs un-indented lines. Canonicalizing an XML string with lxml.etree.canonicalize() appears to leave indentation as is, so every indented line is a difference.

It’s just another illustration that a diffing layer shouldn’t trust service to return a normal form and do extra mangling, going beyond the c14n spec, on its own to ensure both representations are truly as equivalent as possible.

ronaldtse commented 2 years ago

It’s just another illustration that a diffing layer shouldn’t trust service to return a normal form and do extra mangling

I agree. A comparison can only work if the canonicalisation is run on both comparands.

strogonoff commented 2 years ago

It’s just another illustration that a diffing layer shouldn’t trust service to return a normal form and do extra mangling

I agree. A comparison can only work if the canonicalisation is run on both comparands.

Yes, it can ensure it runs the same canonicalization algorithm with the same parameters, and do any extra normalization beyond XML c14n spec if required.

strogonoff commented 2 years ago

The service canonicalizes XML now, and the order of elements should be consistent (more details: https://github.com/ietf-ribose/bibxml-service/issues/239#issuecomment-1193069048)