Closed Digitalone1 closed 1 year ago
So I haven't pushed those changes to prod just yet, but will ideally do so today.
And once I do will let you know how editing our constraint goes.
As for the solution, I believe what might be best is to create an additional scheduled microservice that will run to clean this up. As it should be a rather quick and simple job for it to do
So @Digitalone1 I've just tried running this script, and unfortunately there is an error.
Essentially there is a package that has the following versions published:
1.9.1-0
1.9.1-1
1.9.1-2
They have many versions like this. And it seems I may have misunderstood you with the original PR, since while we do want to keep the generated columns to be unique, I'd like to see if it's possible to support these tags within the actual semver columns. Is it possible that maybe we add a semver_ext
generated column, that could contain whatever comes after the traditional semver and have then the unique constraint placed on that?
If there is only a single package with this issue, I propose to keep the latest version and delete the other ones.
We already support the extensions and we are permissive on the right side of them. But at least the main part should be unique on what is published on pulsar. An author can have all the extensions they want for the release on their repo, but in my opinion we should require the uniqueness of the main first part x.x.x when a release is made on pulsar.
This is because we need something to sort the semver. It's just hard with stardard semver and luckily we found a way through regex to do it. Supporting also the sort of extensions is a lot harder because they can have anything inside, it's tricky to sort numbers with alpha
, beta
, etc...
So I'd say, at least let's keep the main semver unique and require that for publishers, then in the future we will introduce something to support different channels (stable/alpha/beta... which are the main purpose of extensions) organizing the version table with an additional column (native, not generated) still having unique semver. We already talked about that on discord, I will open an issue for discussing how we want to organize it.
Let me know what do you think. It's not that we cannot do another generated column. It's that it will be a lot harder for us to manage packages and more headaches that it's not worth for a single package. We need integrity on the database and publishers have to collaborate in maintaining that integrity.
No you have a completely valid point. For the time being, until we have proper support for version channels you aright right, we should just require a unique semver. Since yeah anything can be in there and we won't have any way to sort it.
Unlike with supported version channels we could have predetermined sorting methods.
Thanks a ton, and yeah for the single package I'll delete the effected versions, which luckily non of them are the latest so we don't have to worry to much.
Thanks for the feedback, I'll let you know when I get the chance to try this after work
@confused-Techie Maybe the query stopped on that package only, but there are others.
To see which ones have duplicated versions, run the following:
SELECT package, semver_v1, semver_v2, semver_v3, COUNT(*) AS vcount
FROM versions
GROUP BY package, semver_v1, semver_v2, semver_v3
ORDER BY vcount DESC, package, semver_v1, semver_v2, semver_v3;
@Digitalone1 Are you sure this script is showing only duplicated versions?
I tried adding the semver
column to the output, and many of these don't seem to show any special extension on the semver, so would seem unlikely for duplication. But otherwise this script is returning 69901 results, which if this is the case for duplicated items, I don't believe we would be able to make this change
@confused-Techie Only the rows where vcount
is bigger than 1. That's why I sorted it descending.
Let's use HAVING:
SELECT package, semver_v1, semver_v2, semver_v3, COUNT(*) AS vcount
FROM versions
GROUP BY package, semver_v1, semver_v2, semver_v3
HAVING vcount > 1
ORDER BY vcount DESC, package, semver_v1, semver_v2, semver_v3;
Okay my bad, thanks for posting this second one. Now knowing that vcount
count is my concern it looks like there's about 77 packages with a vcount higher than 1. The vast majority have a count of 2, with the highest having a vcount of 25.
But yeah not super ideal, if we want to manually go through all of these
They're versions, the package should be lower. Please post a screenshot of the following.
SELECT p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING vcount > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
@Digitalone1 Alright, technically the vcount
one is having a syntax error.
But here's the output of your original script
And yes sorry I mispoke, not packages, but versions.
There are 77 versions with multiple vcounts, with the majority being a vcount of 2, and the highest a vcount of 25
@confused_techie Try this
SELECT p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING COUNT(*) > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
Looks like the output is the same.
Just had to change the script to
SELECT p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY p.name, v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING COUNT(*) > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
This is pretty bad :(
I will think about a solution, but it's not easy.
This is pretty bad :(
I will think about a solution, but it's not easy.
I've messaged you further about finding a solution to this on Discord, so lets see what we can come up with.
Please @confused-Techie, run the following script on NodeJS and post here the output. Thanks.
sqlStorage = setupSQL();
console.log(await checkDuplicates());
async function checkDuplicates() {
try {
const command = await sqlStorage`
SELECT p.package, p.name, v.semver_v1, v.semver_v2, v.semver_v3, COUNT(*) AS vcount
FROM packages p INNER JOIN versions v ON p.pointer = v.package
GROUP BY v.package, v.semver_v1, v.semver_v2, v.semver_v3
HAVING COUNT(*) > 1
ORDER BY vcount DESC, p.name, v.semver_v1, v.semver_v2, v.semver_v3;
`;
if (command.count === 0) {
return "No packages with duplicated versions.\n";
}
let str = "";
let packs = [];
for (const v of command) {
const latest = await latestVersion(v.package);
if (latest === "") {
str += `Cannot retrieve latest version for ${v.name} package.\n`;
continue;
}
const semver = `${v.semver_v1}.${v.semver_v2}.${v.semver_v3}`;
const isLatest = semver === latest;
str += `Version ${semver} of package ${v.name} is ` +
isLatest ? "" : "NOT " + "the latest.\n";
if (!packs.includes(v.package)) {
packs.push(v.package);
}
}
str += `\n${packs.length} packages have duplicated versions.\n`;
return str;
} catch (e) {
return e;
}
}
async function latestVersion(p) {
try {
const command = await sqlStorage`
SELECT semver_v1, semver_v2, semver_v3
FROM versions
WHERE package = ${p} AND status != 'removed'
ORDER BY semver_v1 DESC, semver_v2 DESC, semver_v3 DESC;
`;
return command.count !== 0
? `${command[0].semver_v1}.${command[0].semver_v2}.${command[0].semver_v3}`
: "";
} catch (e) {
return "";
}
}
So, giving a manual check on what is seen on the screenshot (I suppose there are more not shown), the followings are the latest versions:
Alright @Digitalone1 I've written and ran the script.
It needed a few small modifications, but I'll upload the results, and you can review the final script in #53
Below are the results:
Okay, 60 package with duplicated versions. 11 latest versions with duplicates to fix.
So, what do you think to do now? I propose to delete (erase, not tagging with removed) the versions that are NOT the latest.
Then fix the remaining setting the proper latest and removing the other versions that have the same semver but are not the latest.
Saw the script. It's good. I'll modify it to remove the versions that are not the latest just to start. I'll provide the modified script when possible.
I'll check also the remaining packages to complete the list of duplicated latest versions.
@confused-Techie Tell me if you agree or have other plans. Thanks.
While I want to minimize the amount of deletions that occur on the server, if we don't see a smooth way to support both, then your plan sounds good to me.
We can remove the versions that are duplicated, and not the latest.
But as for packages that do have the latest tag we will likely want to contact the repository owner, prior to any deletion or modification of their version.
I do again want to say I'd rather find a way to support both, since especially this requirement, when not met just means we can't properly sort the packages versions, and we could leave that up to the editor. But otherwise then lets go ahead with this plan if there's no way to do so.
As for writing the script, if at all possible do you mind adding it as another script in that same folder? I have a feeling having scripts on hand to sort through packages can prove useful, with just a change of a few values.
Package | Version | License | Archived | Latest Release | Latest Release Is Marked Latest on Backend |
---|---|---|---|---|---|
learn-ide-3 | 3.0.0 | MIT | Yes | v3.0.0 | Yes |
atom-gitlab | 0.1.3 | MIT | Yes | v0.1.3-7 | Yes |
duotone-stark-sea-syntax | 2.1.0 | MIT | No | v2.1.0-6 | Yes |
atom-selection-length | 0.1.0 | MIT | Yes | v0.1.0-4 | Yes |
duotone-bright-sea-syntax | 2.1.0 | MIT | No | v2.1.0-6 | Yes |
atom-symbols-view-plus | 0.118.2 | PD | Yes | v0.118.2-plus.3 | Yes |
atom-citron-formatter | 0.0.3 | None | No | v0.0.3-1 | No (0.0.3) |
build-ember | 0.0.1 | MIT | No | v0.0.1 | Yes |
mcfunction-lang | 1.18.2 | MIT | No | v1.18.2-build-2.2.0 | Yes |
require-anywhere | 2.0.7 | MIT | No | v2.0.7 | Yes |
typo3-fluid-atom-snippets | 1.6.0 | None | Yes | v1.6.0 | Yes |
zero-title-bar | 0.3.0 | MIT | No | v0.3.0 | Yes |
Before making the deletion script for non-latest versions, this is a recap of the packages that have latest duplicated versions.
I agree to contact the publishers, but at the same time I wonder if it's really needed to inform them. We could just set the latest released on their repo as the latest on our database. I mean, this is what they wanted in the first place. Marking the wrong latest was our issue while importing the atom database to the new backend (maybe it's not an issue for some of them which have their latest released as the latest on our side).
Anyway, if we have to contact them, @confused-Techie what do you think about doing it only for packages that are not archived?
About supporting alpha/beta versions, I will open a specific issue about the introduction of different channels. But I'd like to resolve this first.
Also, what did we agree about packages without licence? Should we contact the authors to recommend to move a free license? We discussed on discord, but I don't remember at the moment...
Update: added two columns after checking the latest releases.
So I checked through swagger on the /api/:packType/:packageName
and all the packages listed in the table have the latest release marked as latest on our beckend, except one: atom-citron-formatter
As you can see in the tags, v0.0.3 (latest in the database) is at the top, but v0.0.3-1 (second on the list) has been released one day after.
(And this is an example of why we need to sort the versions not relying on extensions)
So, substantially we only need to fix this package (which is also without license, I don't know if it 100% fit our policy).
Sorry for my lag. But to answer a few things:
atom-citron-formatter
since that'd be awesome if that's the only one that needs real interaction with the maintainer.Don't worry, take your time.
- As for what to do here, I'll go ahead and run your script as recommended, But so beyond deleting packages old versions, does this mean the only package we will actually have to modify is
atom-citron-formatter
since that'd be awesome if that's the only one that needs real interaction with the maintainer.
atom-citron-formatter
is the only package which its latest release does not reflect the latest status on our backend.
Please do not forget to make a backup before running the script for safety reasons.
Since there's #64 and no need to change the unique constraint anymore, here we discuss only about the cleanup of authstate codes.
@confused-Techie Did you make the cron job? Is it running regularly? At which interval?
Can you post the result of the following? Thanks.
SELECT id, created
FROM authstate
WHERE created < CURRENT_TIMESTAMP - INTERVAL '10 minutes';
I have not gotten around to creating the cleanup. But the code utilizing the new AuthState has not been pushed to production yet, since we still have the updates to version handling to work out. Once we get to a point where we can push those changes I'll start running the cleanup for the auth codes
@confused-Techie Can you show the following to ensure the cleanup is taking place so we can close this issue? Thanks.
SELECT id, created
FROM authstate
WHERE created < CURRENT_TIMESTAMP - INTERVAL '10 minutes';
@Digitalone1 the output of the above command is empty, returning 0 results.
This issue has been resolved via pulsar-edit/package-frontend#60
.
In that PR you can look at .config
to find the interval at which this task is scheduled
About #39, I open this issue to know about the cronjob of
authstate
table cleanup.@confused-Techie I know I could have asked on Discord, but here it's easy to keep track of changes. Which solution did you adopt to run the cleanup?
https://github.com/pulsar-edit/package-backend/blob/main/scripts/database/create_authstate_table.sql#L9-L10
Besides, did you have issues updating the
unique_pack_version
constraint?