jazzband / tablib

Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.
https://tablib.readthedocs.io/
MIT License
4.56k stars 588 forks source link

Addition of ElementTree changes HTML output #558

Closed matthewhegarty closed 9 months ago

matthewhegarty commented 11 months ago

I noticed that this change breaks the CI build in django-import-export. This is for cases when we are escaping harmful characters such as script tags.

If there is already an escaped character in content, then the use of ElementTree will double escape, which will not format correctly. This test (added to HTMLTests) will reproduce the issue:

    def test_html_export_with_special_chars(self):
        self.founders = tablib.Dataset(headers=self.headers, title='Founders')
        self.founders.append(('J & J', 'A', 90))
        self.assertIn("J & J", self.founders.html)

(this test passes in v3.3.0)

This test fails because the '&' string is exported as '&' So if someone has 'J & J' in their content (i.e. rendered in escape & amp;), they will now get "double escaping" 'J & amp;amp; J'.

claudep commented 11 months ago

I understand this is a breaking change (if not reverted, it should be documented). However I think that security-wise, it's a much better default behavior than the previous one. Maybe we should add an option to the format so that the client code can specify that it already cared for escaping the content (and thus avoid the double-escaping)?

matthewhegarty commented 11 months ago

As long as the escaping is done we can handle it ok (so no need for an additional option), however it would be ideal if we could opt into the new ElementTree format when we are ready. Perhaps to do that you could add a deprecation path using a settings flag so that clients can opt into ElementTree? I think without this you could well get complaints that exports have changed when the next release goes out. Then it could be documented and the change explained (it took me a little while to figure out why our CI build was broken).

matthewhegarty commented 10 months ago

Hi Claude - are you ok with me submitting a PR to try to make this new dependency on ElementTree optional? Then we can plan the deprecation in django-import-export. Please let me know and I'll pull it together.

claudep commented 10 months ago

Submitting PRs doesn't require any permission :smile:

matthewhegarty commented 10 months ago

Understood... Just wanted to double-check it was something you would consider before I started.

matthewhegarty commented 10 months ago

I looked into whether we could control escaping of text elements using a parameter or similar, but this is not possible (certain characters always get escaped).

matthewhegarty commented 10 months ago

@claudep would be grateful if you could take a look at #562

claudep commented 10 months ago

Thanks for suggesting the pull request. However, I'm not thrilled at all by generating HTML by hand as a string (both performance and security-wise). Hopefully other contributors will chime in and tell us their feelings.

Did you try on the django-import-export side to conditionally escape based on tablib version? Would it be doable?

matthewhegarty commented 10 months ago

Thanks Claude, I did think there must be a good reason why the HTML has not been generated by hand previously, as it seems like the simplest approach considering we are outputting a table. Perhaps escaping the values by default (i.e. with html.escape()) would allay the security concerns, because then you would always end up with an HTML table containing safe values, and this passes all current tests. If escaping by default, I can't see how this would create any security issues, and would allow users to disable escaping if they needed to.

I cannot see performance being an issue except maybe for very large datasets, and even then I don't think it would be too dissimilar to existing versions.

Did you try on the django-import-export side to conditionally escape based on tablib version? Would it be doable?

It would not be doable because the current version of tablib will always escape (because it uses ElementTree). We'd have to find a way to conditionally load a previous version of tablib (i.e. with MarkupPy) to ensure we didn't break any existing implementations.

It might be that we have to release our new version with the ElementTree dependency, and see if anyone complains.

claudep commented 10 months ago

I see django-import-export supports html export without escaping content. Do you think that there are valid use cases for that capability? Is deprecating this an option for you?

matthewhegarty commented 10 months ago

Do you think that there are valid use cases for that capability?

I think there is a valid use case, which is if users already have escaped content in their db, and they want to export without double escaping. However, I don't know if anyone does this, and I think it is unlikely that they do.

Is deprecating this an option for you?

Yes, I think this is the way to go. I will pin the latest release of tablib in import-export, and then add a note to advise that escaping will be mandatory in a future release. If anyone complains, I can think again, but otherwise I would then be free to upgrade to the latest version of tablib in future.

Thanks for your help with this. Feel free to close the PR if you wish.

matthewhegarty commented 10 months ago

For reference, the import-export PR is here

hugovk commented 10 months ago

I think it might be easier all around to pin to 3.5.0 until you're ready to upgrade, and that the next Tablib should be 4.0.0, and mention the change clearly in the release notes.

Thank you though for testing the main branch in your CI, it's rare but very welcome to get pre-release feedback! 👍