Open prof-milki opened 3 years ago
@thorwhalen seems active in GitHub, we would really appreaciate it if he would take the time to remove these packages himself if it was an error.
Is anyone ever closing ticket here? To me this looks like a good reason to block this user.
spam and/or squatting but clearly not normal usage.
@pypa/warehouse-admins
@thorwhalen, what's going on here? It seems like you're republishing existing libraries under these short project names? https://github.com/thorwhalen/uu
First of all, I apologize: I noticed this thread only recently, when I noticed that some things weren’t working anymore because some of my names were taken away and looked into what happened. I now see I received email notifications of the thread, but they were buried in the plethora of github emails I get. I hope I didn’t inconvenience anyone two (pun not initially intended) much.
I didn’t answer right away because I immediately looked up similar issues, and seeing how subjective, and sometimes trolling, the whole thing could be, I was hesitant to step into that time consuming and possibly emotional pit. This is why there’s things like pep 541 — to try to give a basis to the madness, though those who want to fight still do so with interpretation .
That said, let me shyly give this a chance.
Hopefully a bit of declaration of intent can help shed on some light on the behavior and be sufficient to appease and resolve.
The effort stems from two major needs/wants. In a nutshell:
Over the years, when I spot something in a project that is not private or particular to the project, I try to extract the reusable functionality and put it up on public github. It makes it easier to reuse in other projects, collaborate with their teams, etc. The way I was doing this wasn’t sustainable — one humongous repository with hundreds of, sometimes totally unrelated functionalities in them, no packaging, no dependency management, etc.
I’ve been wanting to tackle this ugly problem for a while, and a year a half ago I had a bit of time to look into it. I wanted to break things up in such a way that future projects didn’t have to clone and install a mammoth, but rather just pip install only what was needed. I also am hoping that having things more broken up and themed will enable me (and who ever I’m working with at the time) to have a specific home for utils that will be developed in the future; thus allowing even the little modules I have to snowball into something more significant.
Know that quite a bit of manual work went into it, in case automation is the issue here. If it’s the “shouldn’t be too fast” part of automation that is the critique, know that I spent extra time, in fact, to automate some of the manual work I had already done. This is because I felt that this “project outgrows itself” problem is common enough (to me) that I’d like to achieve my concrete goal leaving some tooling crumbs on the way.
Again, this is not me targeting a bunch of names and slapping code into it. This is me taking a big corpus of code that existed already, and was already broken up in packages, sub packages, and modules, and make these into fine-grained pip installable packages.
Let me, now, address some of the constructive critiques I see above.
Yes. But it’s part of the point. Bulk process the big corpus and then bulk publish. It’s really how I’d like to do it, so that I can make improvements globally and consistently.
For example, in my todos are things like: automate install_requires, documentation, creation of keywords (to better search this mess!), get more naming and typing consistency, etc.
The way I’d like it to go is: Create the tools to process the code, apply globally, and re-publish changes.
Is there some “nice” parameter I should include in my publishing process (like wait 5s between each package)?
Hm. There's a lot of scope for subjective discussion here. I probably should avoid getting into that.
I'm guessing the "None" was an exaggeration, so I won’t take it as an insult to my intelligence or choices.
I will concede though that some of them are way too minimal and should probably be merged “out” into some other related package, or I should find some other relevant code to merge “in” to make the package more significant.
Yep. I agree with that. With the install_requires, it’s my top todo!
Right now, what I did was just take the module doc and copied it into the README.md. But please note that I had to go through more than 100 modules and write a short doc manually.
Also note that quite a few functions (the main ones that I use most) have docs and doctests in them. So it’s documented in that sense. What I intend to do is extract those to make a more useful readme and eventually (if the size and frequency of use of the package warrants it) some proper documentation site hosted in github.io.
Yeah… that’s one I hesitated on for some time. Pypi suggests that packages names should be “short and memorable”. I prefer both, and genuinely tried to get better names, but I was going down a rabbit hole there (an interesting one involving semantic analysis etc., but won’t go into detail here). I favor short over descriptive and just couldn’t get (available) descriptive names short enough.
The saving grace for me was to realize three things:
I saw that there were a lot of two-letter available packages, so that sealed the deal:
Please also note that I did spend some time mapping the names to the packages. I didn’t have to, but I still did. Now it’s FAR from perfect (I regret some choices now, but it’s a bit hard to change without being disruptive), but many DO have a link only known to me. Here are a few examples, from the pretty good, to pretty far fetched:
rv
- Utils to work with randomness - R.V stands for Random Variable, a fundamental concept in statisticscr
- binary classification count results - C.R. = Count Resultsun
- Analyzing classes - “un” is the indefinite article in French (so as in… AN instance of…)kw
- Utils to work with storage - K for key and W for write, because in storage we often write something somewhere, referenced by a keyI just came up with another TODO: I should probably add these “name rationale” explanations in the READMEs.
So, hopefully I’ve demonstrated that this is neither a shameful squatting of names nor a rogue experiment, and we won’t have to go into the weeds of opinions.
It is FAR from perfect, and I’m still trying to figure out how to achieve what I want to achieve, but I tried to leave it in a fairly clean state when I first published these, and have recently made some more improvements.
I should add the original comment was observational/speculative because package purposes weren't deducible from the absent documentation. And while the compartmentalization and convenience aspect explains the names, in quantity it looks gluttonous.
Terse package names are fine (though if
seems unimportable). The recommendation for terseness comes with caveats though. Names should still be proportional to generability. Some of these packages are rather domain-specific. Hence should recognize that the pypi namespace is a bit more uniform than the heavily segmented Github allocation.
May I suggest you consider reshuffling the majority into subpackages? Significantly more work to set up, but the pip extras mechanism might be the way to go here:
pip install tcw[is,da,pk,gd]
I would like to join the complaint.
I don't want to accuse anyone of deliberately obstructing other developers. But when someone occupies ~12% of all 2-LatinLetter combinations, it is hard not to see provocation in such behavior.
Especially, since the packages are not well documented. And not maintained? (last update: 21/20. October 2022)
@thorwhalen Shouldn't be "uu" + subpackages be enough? (similar to what @prof-milki mentioned) (https://github.com/thorwhalen/uu)
Therefore, please revoke all claimed names. @thorwhalen had enough time to solve this issue. This ticked was opend on Jan 8, 2021!, 3 years ago
Dear moderators,
Can we either close this issue, or figure out how to resolve it. I'll respond more at length as soon as I get a chance (this week hopefully). Namely, why the solution proposed above isn't really one for me (or at least, it would be a pain). I'd also like to get a sense of clear constraints (e.g. more README.md? individual github project? more regular updates?) under which to find a solution (and hopefully one that's not too time consuming or disrupted). I sought guidance in pypi's terms and conditions and found I wasn't in violation at all. I also poked around the existing pypi projects and found that the projects contested here are far from being under average in "qualtiy".
As for the comment by h-a-d-o, I implore you to cast it aside with the gentle indifference it deserves. The sole contribution of h-a-d-o to the GitHub cosmos appears to be the comment in question, as evidenced by their GitHub ghost town: https://github.com/h-a-d-o.
It seems to me to be a comment not born out of engagement or care, but rather the digital equivalent of shouting into the void. And so, with a slightly raised eyebrow and a hint of a smirk, we move on.
Lots of blablabla.
Ghost town like your mostly undocumented public repos? 💤
Just because there aren't any public repos doesn't mean much. ("Insider tip": in 2024 you are allowed to have several accounts on several git-hosting plattforms 🤣)
@thorwhalen Why don't you just setup a private PyPi Package Repo for your packages/modules? Gitlab offers this in a very convenient way.
@prof-milki: The solution involving "optional features" is not well-suited to our needs for several reasons beyond the initial investment required to implement it effectively. The drawbacks of this method are widely recognized and I choose not to delve into them to avoid provoking further negative responses from anonymous online commentators. Our specific situation, often dealing with edge computing and microservices, necessitates a focus on minimizing resource use, dependencies, and installation times.
I have detailed my methodology in a previous discussion, available here. As mentioned there, I try to separate project-specific private aspects and public-publishable reusable code, and have always encouraged my team or others around me to do the same.
To enable this, it's important to have package names you can build on. I tried to only publish to pypi once the package was mature enough, but this has repeatedly led me to running into taken names (some that I had even checked on before choosing the project name).
When this happens, it can cost quite a lot to transition to a free name. The code imports isn't the biggest of problems: The difficult part is that all communication around the code (docs, emails, slack messages) use an old name. It's enough to discourage the publication effort.
It's at that point that I looked at what the rules of pypi were, and what the general "quality" of the packages out there were. I found that in fact, the bar was quite low, both in rules and in practice. Many packages are truly old squats. I now gathered and computed actual stats on this.
It became clear that I couldn't wait for a package to be completely ready to publish it, but also that the constraint was neither required, nor practiced, in general.
When it came subdividing the pile of utils I had accumulated and finding a namespace for them, it seemed like two-letter combos was a good choice -- including to reduce the likelihood that I'd be hindering others with an excellent name for a high quality package.
Although the only significant feedback I got since I explain my rationale, has been the unwelcome attention of internet trolls, with no other major complaints, I request the closure of this issue to prevent further hostility. This unnecessary aggression has already resulted in considerable wasted time and energy.
tldr;
(Stats computed on a random sample of 10K pypi packages.)
The time I wasted recently on this was not spent to update the "uu" packages in question, but to try to determine a bit more scientifically what my target "quality" should be, but computing and comparing the hard numbers. Of course, the trolls out there will still complain and point out exactly those stats that I've missed, but no one can say I haven't thoroughly looking into it and responded (@di -- perhaps you can remove the "awaiting response" tag at this point?).
My first attempt was to write a script that would actually install the packages and carry out some operations on them.
I wanted to have more precise information than just the json meta data that can be found with https://pypi.org/pypi/{pkg}/json
.
But realized that this was not only resource intensive, but a bit dangerous.
Still, I'll convey the one stat I've computed there that cannot (as far as I can see) be derived from the json metadata:
These stats compare the 78 "uu" packages with a random sample of 100 packages from (the 500K+) packages of pypi:
uu installable: 100.0% (78 out of 78)
random pkgs installable: 34.0% (34 out of 100)
And now the stats I computed from a random sample of ~10K package names taken from https://pypi.org/simple/. Not that 31 of these didn't have any json (so are names that are "taken", but not usable. Still, I've used 9969 as the count when computed percentages.
What percentage of some key metadatas are present?
Attribute | uu_pkgs | random_10K_pkgs |
---|---|---|
version | 100.00 | 99.99 |
summary | 100.00 | 94.05 |
home_page | 98.72 | 76.17 |
project_url | 100.00 | 99.99 |
license | 100.00 | 64.75 |
description | 100.00 | 80.43 |
size | 100.00 | 98.18 |
upload_time_iso_8601 | 100.00 | 98.18 |
As a proxy to this, I looked at the package size of the current version's release (taking the sdist
, or the bdist_wheel
if sdist
not present).
Description | Size (bytes) |
---|---|
uu median package size | 10656 |
random 10K median package size | 9631 |
"worst" of the uu
packages to the rest of the world:
32.67% of the random 10K packages have a size smaller than the smallest uu package (which has 4848 bytes)
Description | Median Last Update Age (days) |
---|---|
uu median last update age | 468 |
random 10K median last update age | 834 |
"worst" of the uu
packages to the rest of the world:
62.49% of the random 10K packages are older than the oldest uu package (which is 468 days)
Json metadata proxies: README.md (a.k.a. "description") and short summary.
Description | uu | random 10K |
---|---|---|
Percentage of empty summaries | 0.00% | 5.95% |
Percentage of empty descriptions | 0.00% | 19.57% |
Median description size (characters) | 68 | 822 |
Description | uu | random 10K |
---|---|---|
Percentage of packages with license | 100.00% | 98.77% |
@prof-milki: I hope it is clear from the above that the uu
packages' metrics are in fact consistently above average. The only one that it fails at is the README.md size, which I'll work on (already started, but didn't want to inflect the stats calculation).
It seems the only valid reason this issue even exists is that their are many of them, with strange two-letter names that makes the whole thing stick out. I've explained why that is, so I hope this is sufficient information for you to close this issue.
Project to be claimed
PROJECT_NAMES
: fk, fg, dn, cw, bj, au, ge, ju, ef, ix, yw, if, hz, vd, nm, bh, wv, aw, ji, uy, hg, nw, og, ij, rh, ek, ke, an, yb, ir, ub, ov, nj, ho, oa, ej, hf, yx, oj, yz, ul, xv, mv, xl, ow, xa, ou, eu, zr, iy, el, hm, jo, yv, ys, yp, yl, yi, yf, ya, uw, uo, un, uj, ug, uf, tu, tn, su, rv, ps, nh, na, lv, lh, lb, ky, kw, kr, kc, jy, ba, ha, byUSER
: https://pypi.org/user/tcw/Reasons for the request
This looks like some fairly automated registrations. None of the packages are overly interesting on first glance (nor well documented for that matter). And the two-letter project names have little resemblance to the source repository or files they package.
Now this might be the result of some packaging script gone awry. But it does look a bit like very premature project name reservations. And there's obviously some names that might fit more useful projects (
ps
orhg
at least).Maintenance or replacement?
No transfer required.
Contact and additional research
No response so far.
https://github.com/thorwhalen/ut/issues