pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Formally deprecate journal events? #11918

Open woodruffw opened 2 years ago

woodruffw commented 2 years ago

Just wanted to record this proposal somewhere: the "Project journal" view has had a "deprecation coming soon" warning for a few years now:

Screen Shot 2022-07-21 at 2 00 03 PM

Users are told to visit the security history view instead, which is both generally more detailed (per event) and has more event types. That view has existed since 2019 (https://github.com/pypi/warehouse/pull/6339).

Given that the replacement has been around for a couple of years and that we've been warning about deprecation for as long, does it make sense to formally deprecate journal events and set a timeline for their removal?

divbzero commented 2 years ago

Deprecating journal entries seems to make sense to me. The deprecation warning and new events framework has been in place for about 3 years ago (from your PR #6339) and we also have generic events to add events to any model we would want in the future (from @di’s PR #8324).

What action would be taken to “formally” deprecating the project journal? Would it be announcing a specific date or version when the project journal view would be removed?

To help inform this discussion, do we have insight into how much the project journal page (/manage/project/{project.name}/journal/) is used?

woodruffw commented 2 years ago

What action would be taken to “formally” deprecating the project journal? Would it be announcing a specific date or version when the project journal view would be removed?

One of the PyPI maintainers/admins can correct me, but I think it would involve at minimum:

And then, in terms of the removal itself, it should mostly just be a matter of removing all uses of JournalEvent, deleting the routes and views, and maybe redirecting the current "project journal" route to the "security history" view instead.

dstufft commented 2 years ago

Is JournalEvent different than JournalEntry?

dstufft commented 2 years ago

Ok I see the context now, JournalEntry is what you're actually talking about.

The big thing that relies on the journal is the serial number is actually just the journal entry's ID number, so we would need to add something new to give the behavior that Bandersnatch et al rely on.

woodruffw commented 2 years ago

Is JournalEvent different than JournalEntry?

Huh, my response via email never appeared here...

Yeah, this was just me mixing up words 😅 -- JournalEntry is what I meant.

woodruffw commented 2 years ago

The big thing that relies on the journal is the serial number is actually just the journal entry's ID number, so we would need to add something new to give the behavior that Bandersnatch et al rely on.

I might be missing it, but where does the project's serial number (I'm assuming that's last_serial) depend on the journal entry's ID? I see it as a column on Project:

    last_serial = Column(Integer, nullable=False, server_default=sql.text("0"))

...and that we render it include it in the datamodel used to render the simple index:

def _simple_index(request, serial):
    # Fetch the name and normalized name for all of our projects
    projects = (
        request.db.query(Project.name, Project.normalized_name)
        .order_by(Project.normalized_name)
        .all()
    )

    return {
        "meta": {"api-version": API_VERSION, "_last-serial": serial},
        "projects": [{"name": p.name} for p in projects],
    }
dstufft commented 2 years ago

Most of our denormalization in Warehouse is managed by database level triggers, this one was added here.

CREATE OR REPLACE FUNCTION maintain_project_last_serial()
RETURNS TRIGGER AS $$
DECLARE
    targeted_name text;
BEGIN
    IF TG_OP = 'INSERT' THEN
        targeted_name := NEW.name;
    ELSEIF TG_OP = 'UPDATE' THEN
        targeted_name := NEW.name;
    ELSIF TG_OP = 'DELETE' THEN
        targeted_name := OLD.name;
    END IF;
    UPDATE packages
    SET last_serial = j.last_serial
    FROM (
        SELECT max(id) as last_serial
        FROM journals
        WHERE journals.name = targeted_name
    ) as j
    WHERE packages.name = targeted_name;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_project_last_serial
AFTER INSERT OR UPDATE OR DELETE ON journals
FOR EACH ROW EXECUTE PROCEDURE maintain_project_last_serial();
woodruffw commented 2 years ago

Makes sense -- if I'm understanding the semantics correctly, this only needs to be a monotonically increasing integer, right? In other words, we could theoretically set the last_serial to the current Unix timestamp (64 bits) whenever the relevant replacement security event is triggered?

Edit: Or maybe not on security events, since those aren't formally tied to projects and have high volume, but there's probably something we can trigger on.

dstufft commented 2 years ago

So funny enough, the original mirroring protocol literally used timestamps, as in "give me changes since X".

The serial method was added because there were changes being missed because if you had multiple things being uploaded in the same second as when a mirror says "give me all the changes since X" such that it got some of them but not all of them, the later ones would be skipped.

There's also clock skew problems with different uploaders maybe drifting from each other [^1].

We could work around it by munging things so that when a mirror says, "give me all the changes since X", we just subtract some number from that to give a bit of a window for variations like that. That would probably work? We'd be giving up some level of "we know the mirror will get exactly every change" for "we're pretty sure that unless something goes really weird, they'd get every change".

It looks like Bandersnatch records it's high-water mark for what serial it last saw by just doing max([serials_I_saw_in_this_recent_run]), so other than clock drift causing serials to go backwards, I think that would work fine.

[^1]: Clocks can drift backwards too, but we could fix that by using the monotonic clock APIs.

brainwane commented 2 years ago

Given that the replacement has been around for a couple of years and that we've been warning about deprecation for as long, does it make sense to formally deprecate journal events and set a timeline for their removal?

Yes. It's been long enough.

woodruffw commented 2 years ago

Okay, just to get the ball rolling: any objection to me tweaking the view's current language to emphasize that journal events are deprecated, but without a removal timeline yet?

Another thought I just had: a minimally disruptive change would be to remove the view itself and most of the content from JournalEvents, but the table (and its IDs) around for the time being. That way we could do the deprecation from a browser-facing side without breaking bandersnatch or any other mirroring software.

dstufft commented 2 years ago

Yea we can remove journals from the UI completely, that's not a problem

woodruffw commented 2 years ago

Cool. This isn't scoped in any of my current work, but I'm happy to spend some of my own time doing that this weekend/sometime in the evening this coming week 🙂

dstufft commented 2 years ago

It just occurred to me that we don't need to use something like a timestamp or anything, we can just keep using the database for this.

We get a monotonically increasing, globally unique and ordered serial number because every change inserts a record to the journals table and we use journals.id, but what makes journals.id suitable for this is that it's using the postgres SERIAL type which implements a counter that can only increase, and which you can get the next value from, which happens automatically when that id column is left blank.

But there is no requirement that SERIAL has to be used for primary keys or that it can only be incremented by omitting a value and letting the DB fill in the default value, we could do it by turning Project.last_serial into a database SERIAL column, and whenever something happens that triggers a mirror to need to resync, then we just need to run a query like:

UPDATE projects
SET last_serial = nextval(pg_get_serial_sequence("projects", "last_serial"))
WHERE id = %s

When writing that, it occurs to me that even the SERIAL solution can have the problem of missed updates, because while the sequence guarantees uniqueness and that the serial will be generated in order, it doesn't guarantee that they will be committed in order, so you can still get into situations where two transactions happen A and B which get assigned serial numbers 1 and 2. Then B commits with serial 2, a mirror queries and gets the latest serial is 2, then A commits with serial 1, but gets skipped because 1 < 2.

We could introduce the same kind of window as talked about in the previous comment, go back N changes, but that's kind of hard because we don't actually know how many of those changes were created concurrently, whereas using a timestamp means that we can infer meaning from the serial, and make smart choices about a look back window.

We can even work around the clock skew between multiple nodes problem by letting the database assign our timestamp serial, since we only have 1 master database and we're unlikely to ever have more than that [^1], and just use something like:

UPDATE porjects
SET last_serial = extract(epoch from now())
WHERE id = %s

So we'd only have to worry about the PG server drifting backwards, which we're on RDS and I assume they're running ntp to keep that from happening, so that might be the best option, and then just introduce some 30s or 60s look back window.

[^1]: Multi writer PostgreSQL deployments are hard and the write traffic to PyPI is relatively tiny. We're currently not even a multiple server setup for reads, which is far far more likely to happen.

woodruffw commented 2 years ago

11962 does the first 50% of this -- it gets rid of the journal route and view, but not the DB models or table.

pfmoore commented 2 years ago

I've just been pointed at this discussion.

To add my view, I use the journal data via the XML-RPC changelog_since_serial API, and I use the last_serial data heavily for mirroring. There's nothing specific in the XML-RPC API documentation that describes the changelog data as deprecated, so I don't think it's entirely correct to think that "Given that the replacement has been around for a couple of years and that we've been warning about deprecation for as long".

I assume the serial number APIs will be retained, as they are fundamental to any form of mirroring that tries to be incremental in any way. That specifically means the following information:

  1. The X-PyPI-Last-Serial header on API calls (simple and JSON).
  2. The _last-serial data in the JSON version of the simple API, and the <!--SERIAL xxxx--> comment in the HTML version.
  3. The last-serial element in the JSON API.
  4. The list_packages_with_serial XML-RPC API.
  5. Maybe the changelog_last_serial XML-RPC API. I don't use this one myself as I can get the latest serial as the max of the values from list_packages_with_serial, but I can imagine mirrors that use this one to avoid getting all the package-level data.

If these APIs are going to be removed, I think we need a properly documented and supported replacement, or we have to accept that we've stopped supporting anyone doing incremental mirroring of PyPI, and people wanting to minimise traffic when getting data for offline analysis.

As for the more general changelog (or journal) data, I have an offline copy of this that I use for exploratory data analysis - it's extremely useful for answering questions such as "how many projects are created each month", "what is the average time between releases for projects", etc. And it's the only place I know of to get data on when projects get deleted. The format of the data isn't ideal for analysis, and I'd be fine with getting an improved data format, but losing access to this raw data in some downloadable format would be a problem for me. Also, I imagine that "bridging" the transition from one format to another would be tricky, so I'd like to see some thought given to easing that transition if we are moving to a new API.

I do understand that I'm only one user, and it would be unreasonable for me to expect that the API gets designed just for my convenience, but conversely, I don't think that enough has been done to identify and capture the needs of users of the existing APIs before deprecation. Yes, there's been a lot of general talk about Warehouse wanting to deprecate legacy APIs, but the message has always been that the sticking point is working out how to provide the functionality in better ways, and not that at some point it would just be removed altogether. I also don't think that there's enough thought given to 3rd parties trying to use the data for offline analysis. I know I'm not the only person collecting stats on PyPI usage, and I know I've had a number of people interested in how I get the raw data for my information. But I don't have a feel for whether my approach is typical.

I apologise for not speaking up previously - as I say, I was unaware that this was even being discussed. Hopefully it's not too late for my views to be taken into consideration.

pfmoore commented 2 years ago

https://github.com/pypi/warehouse/pull/11962 does the first 50% of this -- it gets rid of the journal route and view, but not the DB models or table.

Looking at this PR, I should say that I think I'm fine with it - it's only removing access to the data from the web views, which is something I don't use[^1]. My concern is with the other side of things - removing the data from the DB and dropping API access - as well as with the fact that there is currently no access to (what I understand is) the replacement data in the form of audit information.

Also, I see that there's some discussion earlier in the thread about mirroring. So maybe mirroring will be covered - but before the existing approaches are impacted (I hesitate to say "removed", because as far as I can tell there's no formally documented "this is how you should mirror" mechanism), I'd hope there will be some sort of documentation explaining how clients should interact with Warehouse to correctly mirror, in an incremental fashion only downloading what is known to have changed (both for "full" mirrors like bandersnatch, and for partial mirrors like mine, which is focused on metadata).

[^1]: Which may explain why I never saw any of the the announcements of the deprecation, if they were all on these pages...

dstufft commented 2 years ago

I assume the serial number APIs will be retained

Some form of serial number will be retained yes, the method of generation may end up being different (see up thread), but for downstream consumers treating this as an opaque, monotonically increasing integer that describes a general "version" for the data in the database, that should remain the same.

I hesitate to say "removed", because as far as I can tell there's no formally documented "this is how you should mirror" mechanism

There is PEP 381, but it's bit rotted to the point of useless at this point.

I apologise for not speaking up previously - as I say, I was unaware that this was even being discussed. Hopefully it's not too late for my views to be taken into consideration.

No problem! One of the reasons I pointed you here was you mentioned using the changelog APIs, and it occurred to me that this discussion kind of forgot to talk about them. It's certainly not too late (well other than the web UI, but technically that could be reverted if we needed to), as I view this issue as ultimately asking the questions:

  1. Do we want to try and remove the journals table from Warehouse?
    • Sub question, if we don't want to try and remove it wholesale, do we want to remove places where it's surfaced, and if so which ones?
  2. If so, is there any functionality in Warehouse that still depends on the journals table, and do we want to remove it or do we need to come up with a replacement.

So input from people using those APIs is really great.

pfmoore commented 2 years ago

I took a quick look at the "security history" view for one of my projects. It seems rather limited compared to the changelog data. Maybe that's because the web view is a limited subset? There's no "file added" lines, for example. Also, for it to be possible to correlate it with the index data, it would need to use the same serial number. For example, one of the things I do is use the changelog as a way to translate serial numbers to the corresponding timestamps. Another is, when I have a project that's "disappeared" from the simple index or the JSON API, I can look at its last serial number, and check in the changelog to confirm if the last action was "remove project". If it wasn't, I have a bug where I'm not picking up live projects for some reason.

As you can see, the usage is rather adhoc, but that's the nature of the sort of analysis I do in general 🙂

woodruffw commented 1 year ago

Nope, I think that’s just me getting words confused 😅

Sent from mobile. Please excuse my brevity.

On Jul 21, 2022, at 6:16 PM, Donald Stufft @.***> wrote:

 Is JournalEvent different than JournalEntry?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

divbzero commented 1 year ago

Updated the display of certain project events in commit 40bf004 of pull request #12360.

woodruffw commented 1 year ago

Triaging: @tnytown will be working on this, as part of OSS funding provided by the STF: https://sovereigntechfund.de/en/projects/verbesserung-der-sicherheit-des-python-okosystems/

(@tnytown: if you comment here, I'll be able to assign the issue to you.)

tnytown commented 1 year ago

Hello! Looking forward to starting this work :)

tnytown commented 1 year ago

As an intermediate step to removing the journals table entirely, we can remove all fields from JournalEntry with the exception of id, which backs serial numbers. If anybody is using those fields, please chime in! I'll go through the codebase and review their usages as well.

@dstufft What do you think of this approach?

pfmoore commented 1 year ago

If anybody is using those fields, please chime in!

I do record all of the output from the changelog_since_serial XMLRPC API (name, version, timestamp, action, serial) for analysis. Are those the fields you mean? My use of them is intermittent and not exactly critical, but as well as breaking my code, removing them would presumably break changelog_since_serial, and the documentation describes that API among others as supported.

I'm not entirely clear what the scope of this change is in practice, but please can we consider some sort of proper desupport or deprecation process if the APIs are going to be affected - even if they aren't ideal, and we wish we could just get rid of them, breaking user code without proper warning isn't good.

woodruffw commented 1 year ago

I'm not entirely clear what the scope of this change is in practice, but please can we consider some sort of proper desupport or deprecation process if the APIs are going to be affected - even if they aren't ideal, and we wish we could just get rid of them, breaking user code without proper warning isn't good.

Sorry for the lack of clarity here: in practice, the idea is to replace (or at least trim down) the current JournalEntry model as part of removing surfaces that have been deprecated.

I'm now realizing that the XML-API was never actually officially deprecated, so that needs to happen before we make any additional changes here (especially anything that, as you mentioned, would break end user code or mirror functionality).

pfmoore commented 1 year ago

Agreed. I also think we need to not just deprecate and remove, but offer a suitable replacement. At a minimum, we need to understand what the actual usage of the current API is, so that we can decide whether it's acceptable to remove the functionality without replacement.

My personal use of the changelog API is very adhoc (I collect the data for occasional queries on trends and analysis). My usage generally includes:

dstufft commented 1 year ago

In https://github.com/pypi/warehouse/pull/13936, several types of events will no longer be emitted into the journals table.

Specifically: