progit / progit2

Pro Git 2nd Edition
Other
5.9k stars 1.93k forks source link

What is a repository #1119

Open SimpleSamples opened 6 years ago

SimpleSamples commented 6 years ago

The page 1.3 Getting Started - Git Basics in the book defines terms and provides the most basic explanation of Git. Yet it says nothing about repositories. I see the term "database" a lot but I don't know if databases have multiple repositories or repositories have multiple databases. Also, it is not clear if the database is a specific file like in database systems such as MySQL or if the term database in this context is a more general term. That page seems to spend more time explaining what Git is not, in terms that people who are familiar with those other systems are likely to understand, than explaining some of the important basics such as what a repository is.

I hope that is a better question than what you probably thought when you saw the subject "What is a repository". I hope you understand the importance of providing further explanations of the basics that is already very obvious to everyone experienced with Git.

nobozo commented 6 years ago

I agree that this book is unclear about "repository" vs. "database". My advice is to think of these terms as meaning the same thing. You don't have to know what's inside a repository to be a skillful Git user, although it doesn't hurt. The last chapter of the book talks more about Git internals.

SimpleSamples commented 6 years ago

Git internals is not what I meant. I am talking about the book and what is relevant to the book. I meant that soon after that page the book talks about repositories without defining what a repository is. And most everything that says much about Git will use the term repository, correct?

Not everyone will understand the importance of defining terms that is subsequently commonly used but authors that write introductory stuff must understand the importance of defining terms. If the term database is used as synonymous with repository then the book should say repository and possibly clarify what repositories are by saying they are databases, except in a general sense, not relational databases.

I have tried to get a relevant definition of what a repository is and there seems to be inconsistencies in the definitions.

ben commented 6 years ago

This is one of those questions where the answer depends on how much the asker already knows. A repo is "a directory that contains a worktree and a .git folder", or "a database, and maybe an index and a worktree", or maybe "a directory where you do work on the files, but you can also tell Git to record snapshots of those files and communicate with other copies". It's all contextual.

This section was written wayyyy back in 2008, when Git was an underdog in the version-control world. Now that many (most?) programmers learn Git as their first tool, we could probably do a better job here.

If you have a recommendation on how to include this, I'd love to give you credit for writing it. Care to submit a PR?

SimpleSamples commented 6 years ago

Okay, Ben, I will try to stumble through revising the documentation.

It seems that very few authors understand the concept of being conceptual; readers sometimes do. See What do these words mean in Git: Repository, fork, branch, clone, track? - Stack Overflow; Daniel Stutzbach got 60 upvotes for saying without explaining what certain words mean or how git works and related comments.

Also in that Stack Overflow thread, nfm got credited for providing the answer saying "A repository is simply a place where the history of your work is stored.", implying that the files (commit objects) are not part of the repository. If the repository does not include the files then that means that after cloning a repository we must also get a copy of the files. I think that can make things confusing.

I have in the past been confused about how to use Git and GitHub in Visual Studio but it was not a priority for me. But now I want to attempt to make improvements to Microsoft documentation; there is abundant opportunity for improvement there. I really need to understand Git and/or GitHub to do that.

SimpleSamples commented 6 years ago

Oh, and I assume that the Git documentation will soon be a part of the Microsoft documentation.

ben commented 6 years ago

Also in that Stack Overflow thread, nfm got credited for providing the answer saying "A repository is simply a place where the history of your work is stored.", implying that the files (commit objects) are not part of the repository.

You can also have a bare repository, which only includes what's in the .git directory. So a repo doesn't necessarily have a worktree or index, but it definitely has a HEAD and at least part of the history in the form of commits, trees, and blobs. This isn't a simple topic, as I'm sure you're discovering.

Oh, and I assume that the Git documentation will soon be a part of the Microsoft documentation.

Microsoft bought GitHub, not Git. I'm not even sure it would be possible to buy Git. Anyways, the documentation (and this book) will remain safely in the main Git repository, on git-scm.com, and in this repository.

SimpleSamples commented 6 years ago

Microsoft bought GitHub, not Git.

I should have known that. Yes, I was not thinking.

tloredo commented 1 year ago

I realize this is an old issue, but it's still open, and I think there is still confusion about what exactly comprises a repository. It seems to me that now (2023, ~3.5 after this thread started) the docs are pretty explicit about what a repository is, and maybe it's worth spelling out (if only here), and also being more explicit about the concept of a Git project, a phrase that unfortunately has two meanings in the docs (but probably not in a confusing way).

The trigger for resurrecting this issue: Somewhere (I don't have notes from where) I learned that a Git repo is a folder containing a .git folder (with the index/stage and object database) and (typically but not necessarily) a working tree or working directory (the docs use both terms, though the former seems more precise). This is the definition @ben mentions above. I've taught it to my students. But I now think this definition is wrong.

Today I attended a Git training session at my university (Cornell U., held by the Center for Advanced Computing), just with an eye out for tips to help when I teach Git to my students. In that tutorial, the instructor identified the .git folder as the repository, i.e., not the folder one level higher that contains both this Git folder and the working tree.

The "What is Git?" page, Git - What is Git?, towards the bottom, mentions a "Git project" as comprising

the working tree, the staging area, and the Git directory

where the subsequent figure (resembling one I teach with) identifies "the Git directory" as the .git directory, and explicitly (albeit parenthetically) dubs that directory as the "Repository". So this distinguishes a Git project from a Git repository, the project including the working tree (and stage) along with the repo.

The git-init page, Git - git-init Documentation, is quite explicit and consistent about referring to the Git directory (.git by default) as "the repository", including in the case of a bare repo, i.e., a directory not necessarily named .git that contains what is in the nominal .git directory.

Here's a Google search for uses of "Git project" as a phrase in the Git book site:

"Git project" site:git-scm.com/book - Google Search

There are a number of uses of "Git project" in the sense of the "What is Git?" page—as the combination of a repo (.git folder) and a working tree (and stage).

Unfortunately, "the Git project" is also used to refer to, well, the project that built and maintains the Git toolchain (and not just its repo with the Git code!). I think this distinct usage is pretty clear from context (esp. from the definite article, "the"), so it probably isn't of concern for the terminology question in this issue.

So I'm wondering if the way to be extra-clear about this would be to explicitly define the notion of a Git project in the book, e.g., just by adding a few words on the "What is Git?" page indicating a term is being formally defined when a Git project is first mentioned. Maybe just having it in italics would be enough of a signal to the reader, but having the word "repository" appear in the text (and not just the figure) as a part of a Git project would also be helpful.

It sure seems to me to be useful to have a recognized term for the combination of a working directory and a repo, since that's the "place" where most of us do all our Git work. Git project seems like the right term, already implicitly defined by usage in the book.

vsessink commented 1 month ago

@ben are you still into adding some terminology somewhere? How about a "terminology" paragraph right before the Git history. Here's a rough outline. It's not a PR yet because I think you'll probably want to discuss it first. I based the terms on the original question and on the stack overflow questions above. And I'm "asking this for a friend" (yeah really) who almost got upset when I began rattling about a "remote" because she had no idea what that was. So explaining some basic terminology could help, IMHO. What do you think?

=== Some terminology ===

So far, discussing the various types of version control systems, we have come across a few terms that you will encounter throughout this book. We'll introduce them with a short description here. Please note that these aren't formal definitions or anything and we will use the terms somewhat interchangeably.

repository This is, roughly, the files of your project in all your versions, including the versioning information and any other files that are there to help Git keep track of your work. Because of Git's distributed nature, whenever you're collaborating on a project, your repository probably doesn't contain all versions and all files of all of your co-workers. But your repository will contain your files and your changes.

working tree Also called working copy or working area. While it's good to know that there are thousands of other versions of these files in your repository - your working tree is the specific version that you are currently working on.

commit Also sometimes called a check-in. Git doesn't automatically track everything you do - like a word processor would when the "track changes" setting is on. You need to tell Git that you have prepared another version and then mark this feat in your repository. This is called to commit your work to the repository. Git then stamps your new work with a new, freshly generated, commit-ID (see below) and you get to add a "commit-message". Also, Git maintains some extra information for your commit, like the username you told Git to use, the commit (actually: the commit-ID) it was based on and the time and date.

SILLY ASCII ART AS A MOCK UP FOR a linear line of commits, with their short commit ID's
8c3926    64460f    5229f3    51d8fc    5d0023
commit -> commit -> commit -> commit -> commit

commit-id Also known as commit hash, this is a long string of letters and numbers that forms the identity for a commit. Every commit has its own unique commit-ID. Because these ID's are so long, you can also use the 6 character abbreviation in Git to reference them.

checkout Retrieving a specific version or commit from the repository into your working tree is called "to checkout". This means that a checkout changes your working tree. Also note that if Jessica removed half of your work, a checkout of her version will also remove half of your work. But only in the working tree, so nothing is lost.

branch In the above picture, the chain of versioned files is a linear one: a long string of commits, like a time line telling your commit history. But it doesn't need to be that way. You can create (and name) an alternate version history, based on one of the commits. A branch can thus be seen as a separate time line that can co-exist with other time lines. You can use a branch to test or to try something - or you can even use a branch for a side project.

SILLY ASCII ART AS A MOCK UP FOR  a branched tree of commits
commit -> commit -> commit -> commit
       \                            ^
        V                          / a merge
         commit -> commit -> commit
      a branch            \
                           another branch

merge a merge is the inverse of a branch: this is when you combine two versions of a file to become one. A merge thus has two parent commit-ID's.

clone cloning a project means copying the repository, which includes its version history and files.

fork a fork of your work is a clone that someone else is going to use for something else, like a fork in the road.

Some terminology that I didn't get to describe yet:

remote aka upstream (aka server?)

fetch

push