Open audrism opened 6 years ago
The problem is caused by the incorrectly free a null pointer in line 102. I'll fix the bug and commit a new version
txs.
Btw, current fetch does the full clone of the repo. In other words it does not save on network traffic at all. What would help would be to fetch only objects associated with new commits.
Hi, Professor! I run a test to test the fetch operation. I added some code to observe how many objects were fetched. First, I create an empty directory, then use the git_remote_fetch to fetch from remote, and I received 21 objects. Then I modify a remote file, the use git_remote_fetch again, I received 3 objects. It looks like that the fetch operation fetches objects associated with new commits
Great, thats how fetch works with a populated git database. But we have an empty git database with objects are stored elsewhere. Is it possible to modify fetch so that it looks for objects that do not need to be fetched in an external database?
I'm sorry, Professor, I couldn't get the point. To get new commit objects, we have to fetch from the remote repository. Then what is "looks for objects that do not need to be fetched in an external database"? Could you please describe in more detail? I' sorry......
Lets consider the example you have above:
1) "First, I create an empty directory" (empty git repo I presume?),
2) "then use the git_remote_fetch to fetch from remote, and I received 21 objects."
3) Now lets take all these 21 objects and store hashes in some database, e.g., flat file You can use gitListSimp.sh to do that.
4) Create another empty git repo
5) Run fetch on it, but use objects in the external database so that this fetch only receives 3 objects not all 24 objects?
Hi, Professor!
The use case is that I have a database of over 11B of git objects, and would like to update it. In the considered example, all of the 21 objects obtained in the first step are imported into that database, but the repo /home/kgao/Desktop/test/.git is no longer there. I'd like to retrieve only the new three objects, not all 24 objects.
Oh, I know the key point. You have collected commits, but have no local repository, so when I fetch, there is no local repo to compare with, I need to compare with the commit in the database. I thought you have local repo……Then I'll need to modify the fetch logic. I'll try!
Thank you. You can assume that any of the existing object hashes and content can be obtained from the database.
I traced the execution of fetch, and found the git_smart__negotiate_fetch function in the smart_protocol.c file tells the remote what we want and what we have. I modified it to tell the remote that we have the commits that we got last time, but later when executing to the git_smart__download_pack function in the smart_protocol.c file, the fetch operation raises an error! I checked the output, found the remote indeed knew that we have local commits. Then I checked the function, and found that when downloading from the remote, it first writes the odb in the repo to a writepack. The odb is the object database in the repo, which means that to correctly download the new commits, we must have local repo...
Yes, we need to have local repo. The question is if we just need to prepopulate it with the objects we have, or can we replace with an alternative database, e.g., https://github.com/libgit2/libgit2-backends
OK. By the way, what's the type of the database used? mysql, redis or sqlite?
Its a custom database based on tokyocabinet since none of the above work at that size. You can assume an arbitrary interface to it. The difference from the git native object store is that you need to pass repo url (to get heads associated with a specific repo)
OK. Thank you, Professor!
also, the type of object needs to be passed as well as repo: tag/tree/blob/commit
Hi, Professor! I wrote 4 source files these days to populate blob, tree, commit and tag objects into the odb individually. Then I fetch from the remote. The result is still that it fetches all objects, including those we have populated into the odb......It seems that populating the odb doesn't work
Have you checked them in? So what is the difference between fetch done on naturally created repo and prepopulated one? have you set heads properly? For example: cat gather/.git/refs/heads/master e01c6718c3b9264926886d6190b05bfa1d069167
When prepopulating object store, heads would not necessarily be set. Best would be to compare the exact differences between the two repos and that will give hints why fetch behaves differently.
I figured out the cause: when create the first commit, Libgit2 won't create a branch, so I need create the master branch manually. By the way, when populating a binary blob(such as an executable file) to the odb, the libgit2 only offers an API git_blob_createfromdisk() to read a file from the filesystem and write its content to the odb. So I am wondering if the binary blob's content is stored in the database as a file?
Great! It checks the .git/refs/heads/* for the latest commits on each branch and compares them to what is on the server. That way only diffs can be sent back.
Can you pre-populate repo only with commits, or, preferably, only the last commit? Re-creating a large repo from database may be time-consuming, perhaps you can revers-engineer what happens when fetch is done on non-empty master?
Indeed, populating all objects is time-consuming. I'll have a try.
Hi, Professor! I tried to populate only with commit objects these days. I found the function git_commit_create_with_signature() could help. So I modified its logic to populate only commit object. It works. It can populate commit objects, and it can also only populate the last commit.
However, an error similar to the one when I tried to modify the fetch logic earlier to make it look for objects that do not need to be fetched in an external database happened. When I used the database that contains only the last commit or all the commits to fetch(git_remote_fetch) from the remote, a fetch error occurred. I checked the .git/objects, and found that it has a damaged pack file. I also used the git fetch command to repeat. I printed the fetch process, which showed that the remote indeed know that we have the last commit, but when downloading from the remote, it ignored some objects. I found these ignored objects has relation with the missing objects in the database such as the blob and the tree objects.
Great. What happens is that first the additional commits are determined on the server, but in addition to commits, the new trees (and subtrees) associated with the new commits need to be sent. In order to determine what new trees and associated blobs need to be sent a diff is done to calculate delta. It seems that a part of the diff may be done on the client side or, more likely, the lose objects are combined into the compressed storage (pack file) and the prepopulated commit does not have its tree, hence crash.
Can you track the attempts to access local objects during fetch, e.g, using trace or adding print to relevant functions? Once such calls are identified, we can replace them by calls to an external database.
Oh, yes. I'll try to find the corresponding logic. Thank you for your hint!
Hi, Professor! When only populating the latest commit and fetch, an error occurred because of lacking former objects. So I modified the logic to resolve it and succeed. Then another error occurred: missing delta bases. I print out the missing object, find its the latest commit's tree. I tried to populate the latest commit's tree to the odb, but it doesn't work. So I tried to figure out the reason. I find the fetch protocol downloads data, maybe packfile, which is binary data and is organized based on former objects which is called "delta bases". And then, using those delta bases, decompress the packfile. So the error "missing delta bases" occurred when the odb lacks those former objects. Only all of those objects in the odb, the error won't occur. Thus, we must access to the external database. So I want to ask if the external database is complete, for example including binary files, pics and so on. If not, the error will happen again.
The external database contains shas of all objects. It does store content of all tags, trees, and commits, but only text blobs.
Can you stop fetch once it downloads the pack file? All needed objects should be in the pack file, and the binary blobs don't need to be extracted. Pack file format is not too complicated, no need to use git to unpack it.
This talks a bit about pack file format https://git-scm.com/book/en/v2/Git-Internals-Packfiles
In other word, can you write a function that a) Allows fetch to download the pack file b) Stops there
I think that the packfile may contain all what is needed.
I get the point. I need to capture what the fetch operation received from the remote. I think maybe I can print the data to a file. But I am not sure if the received data is consistent with the packfile format. So it need to be confirmed. This talks about the format in detail. https://mirrors.edge.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt I will then validate the received data format. If it is, I will try to decompress it. This talks about it. https://codewords.recurse.com/issues/three/unpacking-git-packfiles
Try to save it first (it might actually save it on its own). I can help you with decoding if it is complicated.
I checked the library code, but I didn't find code relevant to save it. When fetching, Libgit2 will receive data from the remote. The data was stored in a struct _git_pktdata. Libgit2 will receive more than once until all data were downloaded. Then Libgit2 will decompress the data. Missing delta bases occurred when decompressing it. I then tried to write the data in the struct _git_pktdata to a file and succeeded. Is this OK?
If you have an example, of a trace for a simple repo and code, can you share?
typedef struct { enum git_pkt_type type; int len; char data[GIT_FLEX_ARRAY]; } git_pkt_data;
The fetch protocol considers packet's type in three conditions: GIT_PKT_PROGRESS, GIT_PKT_DATA, GIT_PKT_FLUSH. I only care about the GIT_PKT_DATA's packet. So when receiving a GIT_PKT_DATA's packet, I will write its data into a file. Thus, finally all GIT_PKT_DATA packets' data will be written into a file. I uses my own repo to test: https://github.com/KayGau/test.git I populate a commit object whose sha1 is bd5cb845e25ec967486175475caad097287c7033. The received data is written in the following file: https://github.com/KayGau/libgit2/blob/master/packfile
What is the format of GIT_PKT_DATA? Ie what struct/union describes it? I'm on phone right now, hard to search code...
GIT_PKT_DATA is a kind of git_pkt_type, an enum type:
enum git_pkt_type { GIT_PKT_CMD, GIT_PKT_FLUSH, GIT_PKT_REF, GIT_PKT_HAVE, GIT_PKT_ACK, GIT_PKT_NAK, GIT_PKT_PACK, GIT_PKT_COMMENT, GIT_PKT_ERR, GIT_PKT_DATA, GIT_PKT_PROGRESS, GIT_PKT_OK, GIT_PKT_NG, GIT_PKT_UNPACK, };
git_pkt_type is a property of git_pkt_data.
Libgit2 first receives data from the remote in a struct git_pkt:
typedef struct { enum git_pkt_type type; } git_pkt;
Then, judge git_pkt pkt's type property, if its type is GIT_PKT_DATA, then Libgit2 will cast it to git_pkt_data:
git_pkt_data *p = (git_pkt_data *) pkt;
May be you can also write .idx file? Also, where is the code that writes the packfile?
I checked the code, but Libgit2 receives data and don't care about .pack file and .idx file, that is the received data(in the git_pkt_data struct) is a mixture of .pack file and .idx file. Libgit2 will store all received data in another struct and decompress, but because of lacking base object, an error occured. The relevant code file is in the path: src/transports/smart_protocol.c. in the git_smart__download_pack function.
I made a slight mod: can you rerun and checkin both packfile and packfilei?
I checked current packfile, but it does not appear to be in the format described here https://codewords.recurse.com/issues/three/unpacking-git-packfiles
For example: the current packfile starts with PACK version=2 type=1 len=1067, but the chunk after does not decompress.
For this case, packfile is the same as the former. the packfilei's content is: 8191 530 1 Besides, in order to test whether our method (that is, only populating the latest commit object and fetch) will correctly download data. I uses a former succeed but time-consuming method (that is populating all the objects including blob, tree and commit), it downloads the same data.
OK, so the repo is https://github.com/KayGau/test.git
Pre-populate empty repo with bd5cb845e25ec967486175475caad097287c7033
Do fetch,
Right?
Yes. It can correctly download as populating all blob, tree, commit objects. The packfile and packfilei are same in both cases. In the latter case, it correctly downloads the data and decompress it into .pack file and .idx file. I use the following command to unpack the .pack file. The output is in the attached file. unpack.txt
What I am saying is that packfile is not correctly formatted git verify-pack -v packfile packfile.pack: bad
I have added a dump of objects from https://github.com/KayGau/test.git
The format in idx file offset;length;length when uncompressed;sha1;project 8;15;35;23ebb771c634697086b54b29b1572a51be307364;test.git;raw
The format of bin file is simple concatination to get the object above fseek(8) fread 15 bytes, that's the exact content of the object: if you add blob 35\0the content and do sha1 it will be 23ebb771c634697086b54b29b1572a51be307364
git verify-pack -v
command will read .idx file, so this command fail.
I read a tutorial on the internet : https://shafiul.github.io//gitbook/7_the_packfile.html In this example, it shows that the object size is little endian, so the first object's size is 696, not 1067, which is exactly the first commit's size. By the way, this object size is expanded data's size, not the compressed data's size. I think we can try again.
I will check the code again. I think maybe the library processes the received data, for example the code in smart_protocol.c:
error = writepack->append(writepack, p->data, p->len, stats);
error = writepack->commit(writepack, stats);
These two function is in the src/odb_pack.c file, corresponding to pack_backend__writepack_append
and pack_backend__writepack_commit
function. This two functions just involve two functions in the src/index.c file: git_index_append
and git_indexer_commit
. They are very complicate but I will try to comprehend it.
Yes it works. Can you create a program that does this kind of fetch (argument provides packfile name)?
Ok, here is whats needed: parameter: git url and name of outputfile (packfile) standard input: a list of commit shas (from ref/heads) output: packfile
OK! I'll try~
Yes, all the head refs. They are needed to send to the remote so that fetch gets only the diff
Assume that content of these commits can be read from a file
echo tst,https://github.com/ssc-oscar/tst,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 | /usr/bin/get_new_commits path: tst url: https://github.com/ssc-oscar/tst new head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 old head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 no update! Segmentation fault