packages/ directory structure is too shallow

samoht commented 11 years ago

This is a follow-up on OCamlPro/opam-repository/issues/433:

Heya, while OCaml hasn't yet taken the world by storm (but hold your breath, as we're coming!), the directory structure of opam-repository/packages/ is, IMHO, already showing some limits. There are too many dirs in there. We are still far from that risk but, especially considering that each consumes an entry there, we might end up having a too large directory which is, among other things, slow to be ls'd, slow to be showed on a web page, etc.

Precisely for that reasons, other software distributions (e.g. Debian) tend to structure their package space using more deeply nested directory structures. In Debian for instance we use indexing based on the first (source) package letter, special casing the "lib" prefix that tends to be statistically overcrowded. See http://ftp.fr.debian.org/debian/pool/main/ for an example.

Can you consider restructuring packages/ in a similar way?

I expect the "oc" prefix to be overcrowded among OCaml packages, so it might make sense to special case it, instead of "lib".

Also, I suggest to put different versions in different subdirectories, as the number of available versions might grow indefinitely, if you will ever want to allow the possibility of keeping old versions around indefinitely (which is not necessarily a good thing, but still).

avsm commented 11 years ago

We also need a trove to categorise packages by type. So instead of a lexical split of directories, how about organising them by packages/database/sqlite for the moment? The trove will still need tags as things can have multiple purposes, but it's nice to be able to quickly search by directory in the first instance.

Some way of moving older versions into an archival directory (where they are still available, just not in the 'forefront') might also be a good idea.

zacchiro commented 11 years ago

Based on my experience, I think that would be a bad choice. Trove categorizations tend to be more volatile than one might imagine at first. And many orthogonal categorization can be imaged. Having to move around stuff when one changes idea of what's the ultimately perfect categorization would be annoying in the long run.

IMHO, it would be much better to have a "pool" of packages, organized as aseptically as possible, for the sole purpose of avoiding having huge directories. Lexical categorization serves that purposes well.

Then, on top of it, we can add all sort of external categorizations we want, as mere indexes that reference the lexical (and stable, and long-term) categorization.

Just my 0.02€.

avsm commented 11 years ago

You make a fair point. Directory categorisation works very well for BSD ports since users do often navigate and install directly from the filesystem structure. OPAM has much more complex constraints due to all the compiler switches, and so a longer-term solution will probably be a dpkg-style curses frontend that explains its decisions to the user more clearly. This points to a pool of packages also.

In terms of concrete changes to OPAM for the moment though, all we need is for the packages/ directory to support arbitrary nested sub-directories. I guess a sensible heuristic is to traverse every directory until an opam file is encountered within it, and then stop descending.

Another easy answer: for the archived packages, we should simply create an 'archive' remote that holds the older, rarely used packages. This could be added by default in the beginning. We need multiple remotes for the Platform anyway.

-anil

On 20 Feb 2013, at 09:40, Stefano Zacchiroli notifications@github.com wrote:

Based on my experience, I think that would be a bad choice. Trove categorizations tend to be more volatile than one might imagine at first. And many orthogonal categorization can be imaged. Having to move around stuff when one changes idea of what's the ultimately perfect categorization would be annoying in the long run.

IMHO, it would be much better to have a "pool" of packages, organized as aseptically as possible, for the sole purpose of avoiding having huge directories. Lexical categorization serves that purposes well.

Then, on top of it, we can add all sort of external categorizations we want, as mere indexes that reference the lexical (and stable, and long-term) categorization.

Just my 0.02€.

— Reply to this email directly or view it on GitHub.

dbuenzli commented 11 years ago

I definitively think that making $NAME/$VERSION is a good idea, especially since some people do like to release often. I also think that it would be better to do this change before 1.0.

Indexing by package first letter, I think it depends a little bit on how opam-repository intends to be managed in the long term. That is, does it strive for completeness (and thus maybe pointlessness) or does it perform some selection based on quality/usefullness ? In the latter case opam is perfectly able to support multiple repositories, repositories are simple to publish so not being in opam's main repository shouldn't be a big problem. Also older packages could eventually be moved to opam-repository-oldies or what not. As such it may be not be really necessary to index by package first letter (I prefer not to have to navigate manually deeply nested file hierarchies, even though that should not happen too often).

I really don't like @avsm suggestion. These kind of hierarchical organisation don't work as soon as two category apply, may not match each ones idea of where one thing should be and thus make it slower to find the actual package when you are looking for it. Better have lists of tags in package descriptions and maintain a separate mechanism to search by tags. Hierarchical categories don't work.

The question is what's a too big directory then ? So that we can find an appropriate tradeoff between easy manual navigation and technical issues.

Finally, a data point may be hackage statistics, ~5000 libraries, now I'm only following that eco-system from far but I doubt that there are ~5000 worthwhile libraries in there.

zacchiro commented 11 years ago

In terms of concrete changes to OPAM for the moment though, all we need is for the packages/ directory to support arbitrary nested sub-directories. I guess a sensible heuristic is to traverse every directory until an opam file is encountered within it, and then stop descending.

Yep, absolutely. I wanted to make exactly the same suggestion :-) This way, the directory structure could evolve independently from the package manager code.

avsm commented 11 years ago

On 20 Feb 2013, at 09:49, Daniel Bünzli notifications@github.com wrote:

I definitively think that making $NAME/$VERSION is a good idea, especially since some people do like to release often. I also think that it would be better to do this change before 1.0.

I'd prefer to keep the directory+version as it is right now for simplicity. See my alternative suggestion for archival via a different remote.

I really don't like @avsm suggestion. These kind of hierarchical organisation don't work as soon as two category apply, may not match each ones idea of where one thing should be and thus make it slower to find the actual package when you are looking for it. Better have lists of tags in package descriptions and maintain a separate mechanism to search by tags. Hierarchical categories don't work.

Actually, the BSD ports have maintained this very successfully over many years for 8000+ ports. They do of course support multiple categorisation, and symlinks can (optionally) make a port appear in multiple directories.

However, for the reasons outlined in my previous reply, I think a pool-style model may be more workable here.

-anil

samoht commented 11 years ago

Just to clarify:

There are 2 issues here:

how repository structure should be organized (which is mirrored into ~/.opam/repo/<name>/package/...
how OPAM should organize its internal state (ie. ~/.opam/{opam,descr,archives}/...)

For 1. I'm sure we can find an heuristic to make everybody happy (by looking at the opam files) And 2. is very easy to change, and completely transparent to the user.

So the only "hard" technical point here is to come-up with a good heuristic to replace https://github.com/OCamlPro/opam/blob/master/src/core/opamPackage.ml#L164. So it could be quite easy and will not break backward compatibility (so not really necessary to do this before 1.0).

dbuenzli commented 11 years ago

Le mercredi, 20 février 2013 à 10:55, Anil Madhavapeddy a écrit :

I'd prefer to keep the directory+version as it is right now for simplicity. See my alternative suggestion for archival via a different remote.

Yes in fact it's better to keep as it is now, since if everything is traversed as you suggest should be done, nothing prevents us to eventually organise things with $NAME/$NAME-$VERSION.

Hierarchical categories don't work.

Actually, the BSD ports have maintained this very successfully over many years for 8000+ ports. They do of course support multiple categorisation, and symlinks can (optionally) make a port appear in multiple directories. So effectively it's not hierarchical system, which is a good thing.

But really I'm not sure that it is useful to reproduce the tagging system in the file system. One thing is that I'd really like to have that information in a tags: field in opam files and not to have to put/symlink my package in different directories to allow to tag it. Having alphabetical order in the repo seems more sensitive to me.

However, for the reasons outlined in my previous reply, I think a pool-style model may be more workable here Not sure I understand what you mean by that. For me a pool of packages is a repo.

Best,

Daniel

dbuenzli commented 11 years ago

@samoth Right, as a package maintainer, I'm mainly talking about "how the repository structure should be organized".

samoht commented 11 years ago

One thing is that I'd really like to have that information in a tags: field in opam files

Done. opam info <package> -f tags to get the tags back.

samoht commented 11 years ago

I've tried experimenting with this feature request (see the more-flexible-repo-structure branch in my tree) , but I'm not very convinced by the technical details yet. Basically, if you have a non-standard repository structure, then you have to "scan" the whole tree every time you want to know if a package is present in a repository (instead of just checking for packages/$name.$version/opam). So I guess I need to come-up with more clever caching strategy or to think about that a little bit more (which means it won't be in 1.0).

avsm commented 11 years ago

How about adding an archive remote to the default opam-init? That would let us migrate packages to archive without them just disappearing from the default opam-repository.

Then in the future, we could solve the problem of the archive repository getting too bit, perhaps via a repo flag that would mark it as needing deep scans.

On 6 Mar 2013, at 16:49, Thomas Gazagnaire notifications@github.com wrote:

I've tried experimenting with this feature request (see the more-flexible-repo-structure branch in my tree) , but I'm not very convinced by the technical details yet. Basically, if you have a non-standard repository structure, then you have to "scan" the whole tree every time you want to know if a package is present in a repository (instead of just checking for package/$name.$version/opam). So I guess I need to come-up with more clever caching strategy or to think about that a little bit more (which means it won't be in 1.0).

— Reply to this email directly or view it on GitHub.

rdicosmo commented 11 years ago

What about checking for packages/$name/$version ? And then listing packages/$name should provide all available versions without needing to split filenames... avoiding any confusion which could arise if and when somebody comes up with a library named foo2.3 which is not version 3 of library foo2 :-)

On Wed, Mar 06, 2013 at 08:49:23AM -0800, Thomas Gazagnaire wrote:

I've tried experimenting with this feature request (see the more-flexible-repo-structure branch in my tree) , but I'm not very convinced by the technical details yet. Basically, if you have a non-standard repository structure, then you have to "scan" the whole tree every time you want to know if a package is present in a repository (instead of just checking for package/ $name.$version/opam). So I guess I need to come-up with more clever caching strategy or to think about that a little bit more (which means it won't be in 1.0).

— Reply to this email directly or view it on GitHub.*

Roberto Di Cosmo

Professeur En delegation a l'INRIA PPS E-mail: roberto@dicosmo.org Universite Paris Diderot WWW : http://www.dicosmo.org Case 7014 Tel : ++33-(0)1-57 27 92 20 5, Rue Thomas Mann
F-75205 Paris Cedex 13 Identica: http://identi.ca/rdicosmo

FRANCE. Twitter: http://twitter.com/rdicosmo

Attachments: MIME accepted, Word deprecated

http://www.gnu.org/philosophy/no-word-attachments.html

Office location:

Bureau 320 (3rd floor) Batiment Sophie Germain Avenue de France

Metro Bibliotheque Francois Mitterrand, ligne 14/RER C

GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

samoht commented 11 years ago

How about adding an archive remote to the default opam-init?

In my mind, we will always keep the default repository self-contained (ie. all packages should be installable without any archive repo). And as OPAM supports having packages with no associated repositories (which can happen when people remove a repository, or when a package is removed from the repo), I think we actually don't need to add an archive remote.

What about checking for packages/$name/$version ?

I can indeed try to encode some basic policies (packages/$name.$version, packages/$name/$version, packages/$name/$name.$version) and have a special flag for the full scan. I'll try that today.

avsm commented 11 years ago

In my mind, we will always keep the default repository self-contained

Right, but I'm referring to the older versions of packages that are taking up all the space in the current directory tree. We really don't need all those old cstructs, do we? However, it would be nice to keep their descriptions files around somewhere just in case someone needs a specific version, and a slower archive remote would work for that.

I agree that all the reasonably current versions should be in one repository.

Whichever route we take, it would be good to explicitly have a repository format recorded somewhere within the repo itself, rather than probing heuristics...

rdicosmo commented 11 years ago

On Thu, Mar 07, 2013 at 12:03:17AM -0800, Thomas Gazagnaire wrote:

> ``` > What about checking for packages/$name/$version ? > ``` > > I can indeed try to encode some basic policies (packages/$name.$version, > packages/$name/$version, packages/$name/$name.$version) and have a special flag > for the full scan. I'll try that today. Cool :-) ## Roberto

samoht commented 11 years ago

I've finally managed to find a cheap way to encode this. So now, you can use any level of sub-directories in a repository for both the compilers/ and packages/ directories.

Remark1: $name/$version/opam is again ambiguous, so the right pattern is packages/XXX/.../YYY/$name.$version/opam.

Remark2: once this version is released, we need to decide what we do in opam-repository. Having packages/$name/$name.$version/opam and compilers/INRIA/4.00.1/4.00.1.{comp,descr} sounds quite sensible to me. We can also put all the base-* and conf-* packages in separate folders.

ocaml / opam