get_new_commits crashes

audrism commented 6 years ago

echo tst,https://github.com/ssc-oscar/tst,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 | /usr/bin/get_new_commits path: tst url: https://github.com/ssc-oscar/tst new head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 old head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 no update! Segmentation fault

KayGau commented 6 years ago

The problem is caused by the incorrectly free a null pointer in line 102. I'll fix the bug and commit a new version

audrism commented 6 years ago

txs.

audrism commented 6 years ago

Btw, current fetch does the full clone of the repo. In other words it does not save on network traffic at all. What would help would be to fetch only objects associated with new commits.

KayGau commented 6 years ago

Hi, Professor! I run a test to test the fetch operation. I added some code to observe how many objects were fetched. First, I create an empty directory, then use the git_remote_fetch to fetch from remote, and I received 21 objects. Then I modify a remote file, the use git_remote_fetch again, I received 3 objects. It looks like that the fetch operation fetches objects associated with new commits

audrism commented 6 years ago

Great, thats how fetch works with a populated git database. But we have an empty git database with objects are stored elsewhere. Is it possible to modify fetch so that it looks for objects that do not need to be fetched in an external database?

KayGau commented 6 years ago

I'm sorry, Professor, I couldn't get the point. To get new commit objects, we have to fetch from the remote repository. Then what is "looks for objects that do not need to be fetched in an external database"? Could you please describe in more detail? I' sorry......

audrism commented 6 years ago

Lets consider the example you have above:

1) "First, I create an empty directory" (empty git repo I presume?),

2) "then use the git_remote_fetch to fetch from remote, and I received 21 objects."

3) Now lets take all these 21 objects and store hashes in some database, e.g., flat file You can use gitListSimp.sh to do that.

4) Create another empty git repo

5) Run fetch on it, but use objects in the external database so that this fetch only receives 3 objects not all 24 objects?

KayGau commented 6 years ago

Hi, Professor!

In the first step, it indeed create a directory(empty git repo) to store the .git directory. I didn't delete the directory when the program is over.
In the third step, I tried the gitListSimp.sh, in the shell, I type in the follwing command: bash gitListSimp.sh /home/kgao/Desktop/test/.git and it prints all the commits. After run the shell, the directory created in the first step still exists and the .git still stays there. So there is no need to create another empty git repo in step 4, just use the directory created in step1. Run fetch on it, and use objects in the .git in the directory.

audrism commented 6 years ago

The use case is that I have a database of over 11B of git objects, and would like to update it. In the considered example, all of the 21 objects obtained in the first step are imported into that database, but the repo /home/kgao/Desktop/test/.git is no longer there. I'd like to retrieve only the new three objects, not all 24 objects.

KayGau commented 6 years ago

Oh, I know the key point. You have collected commits, but have no local repository, so when I fetch, there is no local repo to compare with, I need to compare with the commit in the database. I thought you have local repo……Then I'll need to modify the fetch logic. I'll try!

audrism commented 6 years ago

Thank you. You can assume that any of the existing object hashes and content can be obtained from the database.

KayGau commented 6 years ago

I traced the execution of fetch, and found the git_smart__negotiate_fetch function in the smart_protocol.c file tells the remote what we want and what we have. I modified it to tell the remote that we have the commits that we got last time, but later when executing to the git_smart__download_pack function in the smart_protocol.c file, the fetch operation raises an error! I checked the output, found the remote indeed knew that we have local commits. Then I checked the function, and found that when downloading from the remote, it first writes the odb in the repo to a writepack. The odb is the object database in the repo, which means that to correctly download the new commits, we must have local repo...

audrism commented 6 years ago

Yes, we need to have local repo. The question is if we just need to prepopulate it with the objects we have, or can we replace with an alternative database, e.g., https://github.com/libgit2/libgit2-backends

KayGau commented 6 years ago

OK. By the way, what's the type of the database used? mysql, redis or sqlite?

audrism commented 6 years ago

Its a custom database based on tokyocabinet since none of the above work at that size. You can assume an arbitrary interface to it. The difference from the git native object store is that you need to pass repo url (to get heads associated with a specific repo)

KayGau commented 6 years ago

OK. Thank you, Professor!

audrism commented 6 years ago

also, the type of object needs to be passed as well as repo: tag/tree/blob/commit

KayGau commented 6 years ago

Hi, Professor! I wrote 4 source files these days to populate blob, tree, commit and tag objects into the odb individually. Then I fetch from the remote. The result is still that it fetches all objects, including those we have populated into the odb......It seems that populating the odb doesn't work

audrism commented 6 years ago

Have you checked them in? So what is the difference between fetch done on naturally created repo and prepopulated one? have you set heads properly? For example: cat gather/.git/refs/heads/master e01c6718c3b9264926886d6190b05bfa1d069167

When prepopulating object store, heads would not necessarily be set. Best would be to compare the exact differences between the two repos and that will give hints why fetch behaves differently.

KayGau commented 6 years ago

I figured out the cause: when create the first commit, Libgit2 won't create a branch, so I need create the master branch manually. By the way, when populating a binary blob(such as an executable file) to the odb, the libgit2 only offers an API git_blob_createfromdisk() to read a file from the filesystem and write its content to the odb. So I am wondering if the binary blob's content is stored in the database as a file?

audrism commented 6 years ago

Great! It checks the .git/refs/heads/* for the latest commits on each branch and compares them to what is on the server. That way only diffs can be sent back.

Can you pre-populate repo only with commits, or, preferably, only the last commit? Re-creating a large repo from database may be time-consuming, perhaps you can revers-engineer what happens when fetch is done on non-empty master?

KayGau commented 6 years ago

Indeed, populating all objects is time-consuming. I'll have a try.

KayGau commented 5 years ago

Hi, Professor! I tried to populate only with commit objects these days. I found the function git_commit_create_with_signature() could help. So I modified its logic to populate only commit object. It works. It can populate commit objects, and it can also only populate the last commit.

However, an error similar to the one when I tried to modify the fetch logic earlier to make it look for objects that do not need to be fetched in an external database happened. When I used the database that contains only the last commit or all the commits to fetch(git_remote_fetch) from the remote, a fetch error occurred. I checked the .git/objects, and found that it has a damaged pack file. I also used the git fetch command to repeat. I printed the fetch process, which showed that the remote indeed know that we have the last commit, but when downloading from the remote, it ignored some objects. I found these ignored objects has relation with the missing objects in the database such as the blob and the tree objects.

audrism commented 5 years ago

Great. What happens is that first the additional commits are determined on the server, but in addition to commits, the new trees (and subtrees) associated with the new commits need to be sent. In order to determine what new trees and associated blobs need to be sent a diff is done to calculate delta. It seems that a part of the diff may be done on the client side or, more likely, the lose objects are combined into the compressed storage (pack file) and the prepopulated commit does not have its tree, hence crash.

Can you track the attempts to access local objects during fetch, e.g, using trace or adding print to relevant functions? Once such calls are identified, we can replace them by calls to an external database.

KayGau commented 5 years ago

Oh, yes. I'll try to find the corresponding logic. Thank you for your hint!

KayGau commented 5 years ago

Hi, Professor! When only populating the latest commit and fetch, an error occurred because of lacking former objects. So I modified the logic to resolve it and succeed. Then another error occurred: missing delta bases. I print out the missing object, find its the latest commit's tree. I tried to populate the latest commit's tree to the odb, but it doesn't work. So I tried to figure out the reason. I find the fetch protocol downloads data, maybe packfile, which is binary data and is organized based on former objects which is called "delta bases". And then, using those delta bases, decompress the packfile. So the error "missing delta bases" occurred when the odb lacks those former objects. Only all of those objects in the odb, the error won't occur. Thus, we must access to the external database. So I want to ask if the external database is complete, for example including binary files, pics and so on. If not, the error will happen again.

audrism commented 5 years ago

The external database contains shas of all objects. It does store content of all tags, trees, and commits, but only text blobs.

Can you stop fetch once it downloads the pack file? All needed objects should be in the pack file, and the binary blobs don't need to be extracted. Pack file format is not too complicated, no need to use git to unpack it.

This talks a bit about pack file format https://git-scm.com/book/en/v2/Git-Internals-Packfiles

In other word, can you write a function that a) Allows fetch to download the pack file b) Stops there

I think that the packfile may contain all what is needed.

KayGau commented 5 years ago

I get the point. I need to capture what the fetch operation received from the remote. I think maybe I can print the data to a file. But I am not sure if the received data is consistent with the packfile format. So it need to be confirmed. This talks about the format in detail. https://mirrors.edge.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt I will then validate the received data format. If it is, I will try to decompress it. This talks about it. https://codewords.recurse.com/issues/three/unpacking-git-packfiles

audrism commented 5 years ago

Try to save it first (it might actually save it on its own). I can help you with decoding if it is complicated.

KayGau commented 5 years ago

I checked the library code, but I didn't find code relevant to save it. When fetching, Libgit2 will receive data from the remote. The data was stored in a struct _git_pktdata. Libgit2 will receive more than once until all data were downloaded. Then Libgit2 will decompress the data. Missing delta bases occurred when decompressing it. I then tried to write the data in the struct _git_pktdata to a file and succeeded. Is this OK?

audrism commented 5 years ago

git_pkt_data is for a single packet, I presume you get an entire array, or just one packet?
Once you have the packets the types you care about are GIT_PKT_REF (for sha1), and GIT_PKT_PACK or GIT_PKT_DATA for content?

If you have an example, of a trace for a simple repo and code, can you share?

typedef struct { enum git_pkt_type type; int len; char data[GIT_FLEX_ARRAY]; } git_pkt_data;

KayGau commented 5 years ago

The fetch protocol considers packet's type in three conditions: GIT_PKT_PROGRESS, GIT_PKT_DATA, GIT_PKT_FLUSH. I only care about the GIT_PKT_DATA's packet. So when receiving a GIT_PKT_DATA's packet, I will write its data into a file. Thus, finally all GIT_PKT_DATA packets' data will be written into a file. I uses my own repo to test: https://github.com/KayGau/test.git I populate a commit object whose sha1 is bd5cb845e25ec967486175475caad097287c7033. The received data is written in the following file: https://github.com/KayGau/libgit2/blob/master/packfile

audrism commented 5 years ago

What is the format of GIT_PKT_DATA? Ie what struct/union describes it? I'm on phone right now, hard to search code...

KayGau commented 5 years ago

GIT_PKT_DATA is a kind of git_pkt_type, an enum type: enum git_pkt_type { GIT_PKT_CMD, GIT_PKT_FLUSH, GIT_PKT_REF, GIT_PKT_HAVE, GIT_PKT_ACK, GIT_PKT_NAK, GIT_PKT_PACK, GIT_PKT_COMMENT, GIT_PKT_ERR, GIT_PKT_DATA, GIT_PKT_PROGRESS, GIT_PKT_OK, GIT_PKT_NG, GIT_PKT_UNPACK, }; git_pkt_type is a property of git_pkt_data. Libgit2 first receives data from the remote in a struct git_pkt: typedef struct { enum git_pkt_type type; } git_pkt; Then, judge git_pkt pkt's type property, if its type is GIT_PKT_DATA, then Libgit2 will cast it to git_pkt_data: git_pkt_data *p = (git_pkt_data *) pkt;

audrism commented 5 years ago

May be you can also write .idx file? Also, where is the code that writes the packfile?

KayGau commented 5 years ago

I checked the code, but Libgit2 receives data and don't care about .pack file and .idx file, that is the received data(in the git_pkt_data struct) is a mixture of .pack file and .idx file. Libgit2 will store all received data in another struct and decompress, but because of lacking base object, an error occured. The relevant code file is in the path: src/transports/smart_protocol.c. in the git_smart__download_pack function.

audrism commented 5 years ago

I made a slight mod: can you rerun and checkin both packfile and packfilei?

I checked current packfile, but it does not appear to be in the format described here https://codewords.recurse.com/issues/three/unpacking-git-packfiles

For example: the current packfile starts with PACK version=2 type=1 len=1067, but the chunk after does not decompress.

KayGau commented 5 years ago

For this case, packfile is the same as the former. the packfilei's content is: 8191 530 1 Besides, in order to test whether our method (that is, only populating the latest commit object and fetch) will correctly download data. I uses a former succeed but time-consuming method (that is populating all the objects including blob, tree and commit), it downloads the same data.

audrism commented 5 years ago

OK, so the repo is https://github.com/KayGau/test.git

Pre-populate empty repo with bd5cb845e25ec967486175475caad097287c7033

Do fetch,

Right?

KayGau commented 5 years ago

Yes. It can correctly download as populating all blob, tree, commit objects. The packfile and packfilei are same in both cases. In the latter case, it correctly downloads the data and decompress it into .pack file and .idx file. I use the following command to unpack the .pack file. The output is in the attached file. unpack.txt

audrism commented 5 years ago

What I am saying is that packfile is not correctly formatted git verify-pack -v packfile packfile.pack: bad

audrism commented 5 years ago

I have added a dump of objects from https://github.com/KayGau/test.git

The format in idx file offset;length;length when uncompressed;sha1;project 8;15;35;23ebb771c634697086b54b29b1572a51be307364;test.git;raw

The format of bin file is simple concatination to get the object above fseek(8) fread 15 bytes, that's the exact content of the object: if you add blob 35\0the content and do sha1 it will be 23ebb771c634697086b54b29b1572a51be307364

KayGau commented 5 years ago

git verify-pack -v command will read .idx file, so this command fail.
I read a tutorial on the internet : https://shafiul.github.io//gitbook/7_the_packfile.html In this example, it shows that the object size is little endian, so the first object's size is 696, not 1067, which is exactly the first commit's size. By the way, this object size is expanded data's size, not the compressed data's size. I think we can try again.
I will check the code again. I think maybe the library processes the received data, for example the code in smart_protocol.c: error = writepack->append(writepack, p->data, p->len, stats); error = writepack->commit(writepack, stats); These two function is in the src/odb_pack.c file, corresponding to pack_backend__writepack_append and pack_backend__writepack_commit function. This two functions just involve two functions in the src/index.c file: git_index_append and git_indexer_commit. They are very complicate but I will try to comprehend it.

audrism commented 5 years ago

Yes it works. Can you create a program that does this kind of fetch (argument provides packfile name)?

audrism commented 5 years ago

Ok, here is whats needed: parameter: git url and name of outputfile (packfile) standard input: a list of commit shas (from ref/heads) output: packfile

KayGau commented 5 years ago

OK! I'll try~

KayGau commented 5 years ago

Before, I just store the packfile into a file named 'packfile'. But I think when paralleling fetch, it may get into trouble. So I tried to add a parameter(packfile name) to the relevant library code to store the packfile into the given name. But these functions involves and are involved by many other functions. So adding a parameter messes up. Then, I changed another method: every running program has its unique process ID, I then store the packfile content into a file named by the pid. Then in the written program, I read from the file, and store its content to the given named(packfile name) file. Is this OK?
I don't understand these standard input's use. A list of commit shas? Are they for comparing with local to determine whether we need fetch? If so, I may need to access the database to compare....
As described before, if we need fetch, we need to pre-populate the latest commit into a local .git directory, so we also need the lastest commit's content.

audrism commented 5 years ago

One option is to have a global variable, but the name of output needs to be a parameter not pid

Yes, all the head refs. They are needed to send to the remote so that fetch gets only the diff
Assume that content of these commits can be read from a file

KayGau commented 5 years ago

The library will first store the content into a file named by pid. I think it is unique when paralleling. Then I will read from this file and write its content to a file name by the parameter. But I am not sure if it can work well. I'll try the global variable option.
I need to think it over about how to use these refs.

audrism commented 5 years ago

Do you really need to prepopulate the repo with the commit content or just need to put shas in the .git/refs/heads/

ssc-oscar / libgit2

get_new_commits crashes #2