pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Lossless encoding of XML-RPC data #14371

Open wayphinder opened 1 year ago

wayphinder commented 1 year ago

What's the problem this feature will solve? The _clean_for_xml function removes some illegal characters. https://github.com/pypi/warehouse/blob/496338e94d6d62811671e7754507d3d8bc3942c0/warehouse/legacy/api/xmlrpc/views.py#L83-L93

This makes it harder to correlate this information with other sources. E.g. the action field contains filenames, that might not match the actual filename because some characters are removed.

Describe the solution you'd like Base64 or otherwise encode relevant fields in a way that does not remove data.

Additional context

woodruffw commented 1 year ago

Some previous context: https://github.com/pypi/warehouse/issues/5653

woodruffw commented 1 year ago

Other context: I'm talking about this with @wayphinder in person. It sounds like the main place where this causes problems for him is in the changelog_since_serial endpoint, where e.action gets munged:

https://github.com/pypi/warehouse/blob/496338e94d6d62811671e7754507d3d8bc3942c0/warehouse/legacy/api/xmlrpc/views.py#L474

My first thought here was to add another member to the end of the list that gets returned here, essentially trading a bit of extra response size for probably not breaking compatibility (since the list will only strictly increase in size, and pre-existing fields won't change). But that might also cause issues that I'm not aware of.

ewdurbin commented 1 year ago

The primary known/supported use-case for this endpoint is PEP-381 and its most prominent implementation bandersnatch.

bandersnatch currently consumes changelog_since_serial in a way that would not choke on the proposed fix (adding another member to the end of the list): https://github.com/pypa/bandersnatch/blob/b3517c5acf696008da0ecd9544a4823a676191d1/src/bandersnatch/master.py#L207-L216

But in general I'm very hesitant to wake the XMLRPC dragon as we currently support it only for mirroring support and do not intend to take on new support for its use.

wayphinder commented 1 year ago

While changing the XML-RPC API would be great, for my use case a one-time dump of the current data in a lossless format would also work. A lot of the same data should be available in the BigQuery data set, but my understanding is that some historic data is missing, which is why I would like the changelog data.