pulsar-edit / package-backend

Pulsar Server Backend for Packages
https://api.pulsar-edit.dev
MIT License
12 stars 11 forks source link

PostgreSQL authstate cleanup cronjob #46

Closed Digitalone1 closed 1 year ago

Digitalone1 commented 1 year ago

About #39, I open this issue to know about the cronjob of authstate table cleanup.

@confused-Techie I know I could have asked on Discord, but here it's easy to keep track of changes. Which solution did you adopt to run the cleanup?

https://github.com/pulsar-edit/package-backend/blob/main/scripts/database/create_authstate_table.sql#L9-L10

Besides, did you have issues updating the unique_pack_version constraint?

ALTER TABLE versions DROP CONSTRAINT unique_pack_version;
ALTER TABLE versions ADD CONSTRAINT unique_pack_version UNIQUE(package, semver_v1, semver_v2, semver_v3);
confused-Techie commented 1 year ago

So I haven't pushed those changes to prod just yet, but will ideally do so today.

And once I do will let you know how editing our constraint goes.

As for the solution, I believe what might be best is to create an additional scheduled microservice that will run to clean this up. As it should be a rather quick and simple job for it to do

confused-Techie commented 1 year ago

So @Digitalone1 I've just tried running this script, and unfortunately there is an error.

Essentially there is a package that has the following versions published:

They have many versions like this. And it seems I may have misunderstood you with the original PR, since while we do want to keep the generated columns to be unique, I'd like to see if it's possible to support these tags within the actual semver columns. Is it possible that maybe we add a semver_ext generated column, that could contain whatever comes after the traditional semver and have then the unique constraint placed on that?

Digitalone1 commented 1 year ago

If there is only a single package with this issue, I propose to keep the latest version and delete the other ones.

We already support the extensions and we are permissive on the right side of them. But at least the main part should be unique on what is published on pulsar. An author can have all the extensions they want for the release on their repo, but in my opinion we should require the uniqueness of the main first part x.x.x when a release is made on pulsar.

This is because we need something to sort the semver. It's just hard with stardard semver and luckily we found a way through regex to do it. Supporting also the sort of extensions is a lot harder because they can have anything inside, it's tricky to sort numbers with alpha, beta, etc...

So I'd say, at least let's keep the main semver unique and require that for publishers, then in the future we will introduce something to support different channels (stable/alpha/beta... which are the main purpose of extensions) organizing the version table with an additional column (native, not generated) still having unique semver. We already talked about that on discord, I will open an issue for discussing how we want to organize it.

Let me know what do you think. It's not that we cannot do another generated column. It's that it will be a lot harder for us to manage packages and more headaches that it's not worth for a single package. We need integrity on the database and publishers have to collaborate in maintaining that integrity.

confused-Techie commented 1 year ago

No you have a completely valid point. For the time being, until we have proper support for version channels you aright right, we should just require a unique semver. Since yeah anything can be in there and we won't have any way to sort it.

Unlike with supported version channels we could have predetermined sorting methods.

Thanks a ton, and yeah for the single package I'll delete the effected versions, which luckily non of them are the latest so we don't have to worry to much.

Thanks for the feedback, I'll let you know when I get the chance to try this after work

Digitalone1 commented 1 year ago

@confused-Techie Maybe the query stopped on that package only, but there are others.

To see which ones have duplicated versions, run the following:

SELECT package, semver_v1, semver_v2, semver_v3, COUNT(*) AS vcount
FROM versions
GROUP BY package, semver_v1, semver_v2, semver_v3
ORDER BY vcount DESC, package, semver_v1, semver_v2, semver_v3;
confused-Techie commented 1 year ago

@Digitalone1 Are you sure this script is showing only duplicated versions?

I tried adding the semver column to the output, and many of these don't seem to show any special extension on the semver, so would seem unlikely for duplication. But otherwise this script is returning 69901 results, which if this is the case for duplicated items, I don't believe we would be able to make this change

Digitalone1 commented 1 year ago

@confused-Techie Only the rows where vcount is bigger than 1. That's why I sorted it descending.

Let's use HAVING:

SELECT package, semver_v1, semver_v2, semver_v3, COUNT(*) AS vcount
FROM versions
GROUP BY package, semver_v1, semver_v2, semver_v3
HAVING vcount > 1
ORDER BY vcount DESC, package, semver_v1, semver_v2, semver_v3;
confused-Techie commented 1 year ago

Okay my bad, thanks for posting this second one. Now knowing that vcount count is my concern it looks like there's about 77 packages with a vcount higher than 1. The vast majority have a count of 2, with the highest having a vcount of 25.

But yeah not super ideal, if we want to manually go through all of these

Digitalone1 commented 1 year ago

They're versions, the package should be lower. Please post a screenshot of the following.

SELECT p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING vcount > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
confused-Techie commented 1 year ago

@Digitalone1 Alright, technically the vcount one is having a syntax error.

But here's the output of your original script

image

And yes sorry I mispoke, not packages, but versions.

There are 77 versions with multiple vcounts, with the majority being a vcount of 2, and the highest a vcount of 25

Digitalone1 commented 1 year ago

@confused_techie Try this

SELECT p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING COUNT(*) > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
confused-Techie commented 1 year ago

Looks like the output is the same.

image

Just had to change the script to

SELECT p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY p.name, v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING COUNT(*) > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
Digitalone1 commented 1 year ago

This is pretty bad :(

I will think about a solution, but it's not easy.

confused-Techie commented 1 year ago

This is pretty bad :(

I will think about a solution, but it's not easy.

I've messaged you further about finding a solution to this on Discord, so lets see what we can come up with.

Digitalone1 commented 1 year ago

Please @confused-Techie, run the following script on NodeJS and post here the output. Thanks.

sqlStorage = setupSQL();
console.log(await checkDuplicates());

async function checkDuplicates() {
  try {
    const command = await sqlStorage`
      SELECT p.package, p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
      FROM packages p INNER JOIN versions v ON p.pointer = v.package
      GROUP BY v.package, v.semver_v1, v.semver_v2, v.semver_v3
      HAVING COUNT(*) > 1
      ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
    `;

    if (command.count === 0) {
      return "No packages with duplicated versions.\n";
    }

    let str = "";
    let packs = [];

    for (const v of command) {
      const latest = await latestVersion(v.package);

      if (latest === "") {
        str += `Cannot retrieve latest version for ${v.name} package.\n`;
        continue;
      }

      const semver = `${v.semver_v1}.${v.semver_v2}.${v.semver_v3}`;
      const isLatest = semver === latest;

      str += `Version ${semver} of package ${v.name} is ` +
        isLatest ? "" : "NOT " + "the latest.\n";

      if (!packs.includes(v.package)) {
    packs.push(v.package);
      }
    }

    str += `\n${packs.length} packages have duplicated versions.\n`;

    return str;
  } catch (e) {
    return e;
  }
}

async function latestVersion(p) {
  try {
    const command = await sqlStorage`
      SELECT semver_v1, semver_v2, semver_v3
      FROM versions
      WHERE package = ${p} AND status != 'removed'
      ORDER BY semver_v1 DESC, semver_v2 DESC, semver_v3 DESC;
    `;

    return command.count !== 0
      ? `${command[0].semver_v1}.${command[0].semver_v2}.${command[0].semver_v3}`
      : "";
  } catch (e) {
    return "";
  }
}
Digitalone1 commented 1 year ago

So, giving a manual check on what is seen on the screenshot (I suppose there are more not shown), the followings are the latest versions:

confused-Techie commented 1 year ago

Alright @Digitalone1 I've written and ran the script.

It needed a few small modifications, but I'll upload the results, and you can review the final script in #53

Below are the results:

Output of the Script ```bash Version 1.16.0 of package mcfunction-lang is NOT the latest. Version 0.2.1 of package php-hyperclick is NOT the latest. Version 0.0.10 of package atom-weui is NOT the latest. Version 1.9.1 of package autocomplete-python-jedi is NOT the latest. Version 3.0.0 of package learn-ide-3 is the latest. Version 0.1.3 of package gitlab is the latest. Version 0.45.0 of package language-go is NOT the latest. Version 1.2.6 of package metrics is NOT the latest. Version 0.34.1 of package github is NOT the latest. Version 0.34.2 of package github is NOT the latest. Version 1.0.0 of package ide-reason is NOT the latest. Version 0.26.0 of package language-shellscript is NOT the latest. Version 0.3.0 of package language-typescript is NOT the latest. Version 0.113.1 of package symbols-view-plus is NOT the latest. Version 2.1.0 of package duotone-stark-sea-syntax is the latest. Version 0.0.0 of package quick-jump is NOT the latest. Version 0.113.2 of package symbols-view-plus is NOT the latest. Version 0.116.1 of package symbols-view-plus is NOT the latest. Version 0.4.0 of package athena-ide-atom is NOT the latest. Version 0.1.0 of package atom-selection-length is the latest. Version 2.1.0 of package duotone-bright-sea-syntax is the latest. Version 0.31.1 of package github is NOT the latest. Version 0.1.2 of package gitlab is NOT the latest. Version 2.2.0 of package ide-mocha is NOT the latest. Version 0.19.0 of package ide-purescript is NOT the latest. Version 0.22.0 of package ide-rust is NOT the latest. Version 0.129.7 of package language-javascript is NOT the latest. Version 0.118.2 of package symbols-view-plus is the latest. Version 0.13.2 of package teletype is NOT the latest. Version 3.0.0 of package vk-terminal is NOT the latest. Version 0.37.1 of package whitespace is NOT the latest. Version 1.2.0 of package a-d-e8-pad is NOT the latest. Version 1.0.0 of package alternatone-stark-sea-syntax is NOT the latest. Version 0.64.4 of package archive-view is NOT the latest. Version 0.64.5 of package archive-view is NOT the latest. Version 0.2.1 of package atom-bmob is NOT the latest. Version 0.0.3 of package atom-citron-formatter is the latest. Version 0.3.1 of package atom-over-and-back is NOT the latest. Version 0.91.1 of package bracket-matcher is NOT the latest. Version 0.0.1 of package build-ember is the latest. Version 0.3.4 of package chameleon-qdt-atom is NOT the latest. Version 1.0.1 of package elastic-tabstops is NOT the latest. Version 8.4.9 of package file-icons-types is NOT the latest. Version 1.1.0 of package ghoulish-ui is NOT the latest. Version 1.1.0 of package git-diff is NOT the latest. Version 0.48.2 of package grammar-selector is NOT the latest. Version 0.60.3 of package language-c is NOT the latest. Version 0.4.5 of package language-citron is NOT the latest. Version 2.0.3 of package language-cql is NOT the latest. Version 0.46.2 of package language-go is NOT the latest. Version 3.0.0 of package language-hql is NOT the latest. Version 0.50.2 of package language-html is NOT the latest. Version 0.51.2 of package language-html is NOT the latest. Version 0.52.1 of package language-html is NOT the latest. Version 0.129.19 of package language-javascript is NOT the latest. Version 0.51.4 of package language-python is NOT the latest. Version 0.72.4 of package language-ruby is NOT the latest. Version 0.27.4 of package language-shellscript is NOT the latest. Version 0.4.5 of package language-typescript is NOT the latest. Version 0.0.2 of package magento-tools is NOT the latest. Version 2.4.19 of package mastermind is NOT the latest. Version 1.18.0 of package mcfunction-lang is NOT the latest. Version 1.18.2 of package mcfunction-lang is the latest. Version 0.70.5 of package notifications is NOT the latest. Version 1.2.0 of package open-on-github is NOT the latest. Version 0.9.0 of package p5xjs-autocomplete is NOT the latest. Version 1.2.3 of package prettier-eslint is NOT the latest. Version 2.0.7 of package require-anywhere is the latest. Version 0.2.1 of package save-autorun is NOT the latest. Version 1.0.1 of package sf-tools is NOT the latest. Version 1.0.5 of package sf-tools is NOT the latest. Version 1.4.0 of package status-bar is NOT the latest. Version 0.117.0 of package symbols-view-plus is NOT the latest. Version 0.0.6 of package syntax-scope-inspector is NOT the latest. Version 1.6.0 of package typo3-fluid-atom-snippets is the latest. Version 0.2.13 of package winnie-atom-plugin is NOT the latest. Version 0.3.0 of package zero-title-bar is the latest. 60 packages have duplicated versions. ```
Digitalone1 commented 1 year ago

Okay, 60 package with duplicated versions. 11 latest versions with duplicates to fix.

So, what do you think to do now? I propose to delete (erase, not tagging with removed) the versions that are NOT the latest.

Then fix the remaining setting the proper latest and removing the other versions that have the same semver but are not the latest.

Saw the script. It's good. I'll modify it to remove the versions that are not the latest just to start. I'll provide the modified script when possible.

I'll check also the remaining packages to complete the list of duplicated latest versions.

@confused-Techie Tell me if you agree or have other plans. Thanks.

confused-Techie commented 1 year ago

While I want to minimize the amount of deletions that occur on the server, if we don't see a smooth way to support both, then your plan sounds good to me.

We can remove the versions that are duplicated, and not the latest.

But as for packages that do have the latest tag we will likely want to contact the repository owner, prior to any deletion or modification of their version.

I do again want to say I'd rather find a way to support both, since especially this requirement, when not met just means we can't properly sort the packages versions, and we could leave that up to the editor. But otherwise then lets go ahead with this plan if there's no way to do so.

As for writing the script, if at all possible do you mind adding it as another script in that same folder? I have a feeling having scripts on hand to sort through packages can prove useful, with just a change of a few values.

Digitalone1 commented 1 year ago
Package Version License Archived Latest Release Latest Release Is Marked Latest on Backend
learn-ide-3 3.0.0 MIT Yes v3.0.0 Yes
atom-gitlab 0.1.3 MIT Yes v0.1.3-7 Yes
duotone-stark-sea-syntax 2.1.0 MIT No v2.1.0-6 Yes
atom-selection-length 0.1.0 MIT Yes v0.1.0-4 Yes
duotone-bright-sea-syntax 2.1.0 MIT No v2.1.0-6 Yes
atom-symbols-view-plus 0.118.2 PD Yes v0.118.2-plus.3 Yes
atom-citron-formatter 0.0.3 None No v0.0.3-1 No (0.0.3)
build-ember 0.0.1 MIT No v0.0.1 Yes
mcfunction-lang 1.18.2 MIT No v1.18.2-build-2.2.0 Yes
require-anywhere 2.0.7 MIT No v2.0.7 Yes
typo3-fluid-atom-snippets 1.6.0 None Yes v1.6.0 Yes
zero-title-bar 0.3.0 MIT No v0.3.0 Yes

Before making the deletion script for non-latest versions, this is a recap of the packages that have latest duplicated versions.

I agree to contact the publishers, but at the same time I wonder if it's really needed to inform them. We could just set the latest released on their repo as the latest on our database. I mean, this is what they wanted in the first place. Marking the wrong latest was our issue while importing the atom database to the new backend (maybe it's not an issue for some of them which have their latest released as the latest on our side).

Anyway, if we have to contact them, @confused-Techie what do you think about doing it only for packages that are not archived?

About supporting alpha/beta versions, I will open a specific issue about the introduction of different channels. But I'd like to resolve this first.

Also, what did we agree about packages without licence? Should we contact the authors to recommend to move a free license? We discussed on discord, but I don't remember at the moment...

Update: added two columns after checking the latest releases.

Digitalone1 commented 1 year ago

So I checked through swagger on the /api/:packType/:packageName and all the packages listed in the table have the latest release marked as latest on our beckend, except one: atom-citron-formatter

As you can see in the tags, v0.0.3 (latest in the database) is at the top, but v0.0.3-1 (second on the list) has been released one day after.

(And this is an example of why we need to sort the versions not relying on extensions)

So, substantially we only need to fix this package (which is also without license, I don't know if it 100% fit our policy).

confused-Techie commented 1 year ago

Sorry for my lag. But to answer a few things:

Digitalone1 commented 1 year ago

Don't worry, take your time.

  • As for what to do here, I'll go ahead and run your script as recommended, But so beyond deleting packages old versions, does this mean the only package we will actually have to modify is atom-citron-formatter since that'd be awesome if that's the only one that needs real interaction with the maintainer.

atom-citron-formatter is the only package which its latest release does not reflect the latest status on our backend.

Please do not forget to make a backup before running the script for safety reasons.

Digitalone1 commented 1 year ago

Since there's #64 and no need to change the unique constraint anymore, here we discuss only about the cleanup of authstate codes.

@confused-Techie Did you make the cron job? Is it running regularly? At which interval?

Can you post the result of the following? Thanks.

SELECT id, created
FROM authstate
WHERE created < CURRENT_TIMESTAMP - INTERVAL '10 minutes';
confused-Techie commented 1 year ago

I have not gotten around to creating the cleanup. But the code utilizing the new AuthState has not been pushed to production yet, since we still have the updates to version handling to work out. Once we get to a point where we can push those changes I'll start running the cleanup for the auth codes

Digitalone1 commented 1 year ago

@confused-Techie Can you show the following to ensure the cleanup is taking place so we can close this issue? Thanks.

SELECT id, created
FROM authstate
WHERE created < CURRENT_TIMESTAMP - INTERVAL '10 minutes';
confused-Techie commented 1 year ago

@Digitalone1 the output of the above command is empty, returning 0 results.

This issue has been resolved via pulsar-edit/package-frontend#60.

In that PR you can look at .config to find the interval at which this task is scheduled