Open audrism opened 6 years ago
I tried to create an empty repo, and put shas in the .git/refs/heads. But two problems occured:
Object
not found - no match for id( a sha)`
So I think we need have not only the shas, but the project in the repo.
By the way, I am confused about b),can you explain it in detail?Ok, so assume that the commits are available, can these be put in a packfile (hand-made)?
To be clear, I understand that you now create commits from the text components, but the whole commit already is in the external database, so it would be more efficient to just place it into the repository e.g. for your test.git repo put f395694fe958521c8946db8a8be016fd2e84f8cd in .git/refs/heads/master and the content of the object into .git/objects/f3/95694fe958521c8946db8a8be016fd2e84f8cd
Oh, yes! I get the point. I misunderstand the 'packfile(hand-made)' before. I'll try this!
The git object is not directly stored in the .git/objects/xx/xxxxxx file. It should first append a header and then uses zlib to compress it. Just as said in the blog: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects I write a Ruby code file and succeed to create the .git/objects/xx/xxx file. Is this OK?
Don't worry about producing these objects, just assume that they are provided from the database in exactly the right format you need to store them in .git/objects/xx/xxx file.
For example, as in the .idx/.bin files, you just fread it from the c code and write it to .git/objects/xx/xxx before opening git repo from libgit2
In any case that, here is whats needed:
1. parameter: git url, and name of outputfile (packfile)
standard input: a packfile of commits (from ref/heads)
output: packfile
OK. By the way, can you give me an example? including the git url, name of outputfile(packfile) and the packfile of commits(from ref/heads). Thus, it is easy for me to test. Thank you !
You can create a branch in your test repo, that will provide you with two heads.
I do not have a packfile creator, but I have a packfilre reader in libgit2/readPack.perl (it still needs work).
As far as creating git objects, take a look at bitbucket.org/swsc/lookup/checkBin.perl
my $code = safeDecomp ($codeC, $msg);
my $len = length ($code);
my $hsha1 = sha1_hex ("$type $len\0$code");
where $codeC is what is stored in the database, then its git sha should match sha1_hex ("$type $len\0$code");
I'm sorry that I can not open this link... an 404 error occurred. What is stored in the database is compressed object's content.
I think we do not necessarily have the packfile of commits. We can store the object's content (that is $code in your listed code) in a file named by its sha. I then read the file and make a little compress, put it to the .git/objects/xx/xxxx.(I have written a Ruby file to do this) For example, there are 2 heads in repository "test". Then, you put these two commit's prettied content(git cat-file -p, just the $code) individually in two file named by their shas.
Then it looks like this:
parameter: git url, name of the outputfile, the directory name where stores these commit's content standard input: shas(from refs/heads) in a line separated by ';' output: packfile
Since I'd like to batch-process thousands of the repos at a time and do not want to create all these thousands of folders/files, can the following be used: parameter: git url, name of the outputfile, the file where these commit's content is stored, and commit shas (in hex format);offset;length on standard input? You can assume that commit content is compressed or not. once the program reads offset/length it does fseek/fread.
What is your bitbucket id?
But I think if we don't create folders, putting object content into the .git/objects/xx/xxx and putting shas into the .git/refs/heads for different repos will get into trouble. Because multiple repositories' refs/heads will in the same directory, and when fetch, program will send all of these out. So I think creating individual folder for different repositories is necessary. My bitbucket id is KayGao
The fetch command will still have to create a repo each time it is invoked and read from the file containing commit content, create .git/objects needed, and run fetch
So the task is split into three parts? That is:
I was hoping to do all three in c. If you need help with the second part, I can help. Just leave the parts you have problmes with for me. The program should be called batch_fetch.c
OK.
I am a little confused about the git_get_last() function. It invokes the filter_wants_1() function in the src/fetch.c. I then made a little modification to it to get refs/heads' name such as master. But I found that it also gets refs/tags fetch.zip
The attachment is what I have modified, you can run echo https://github.com/ssc-oscar/libgit2 | get_last and some refs's name is tag
I ran either:
echo "082d5e91a657b93e15395d76074316b3b70e57e7;master;0;796" | ./build/batch_fetch https://github.com/pidanself/Testbranch Testbranch txt packfile
or
(echo "082d5e91a657b93e15395d76074316b3b70e57e7;master;0;796"; echo "be5a7ee90f39b3fc9b7fba0042706a6a33051603;branch2;796;255") | ./build/batch_fetch https://github.com/pidanself/Testbranch Testbranch txt packfile
Both produced the same packfile with three objects: is that correct?
The repo has two branches, named master and branch2, and there is a new commit after commit 082d5e, means one more commit, tree and blob object. At commit 082d5e, he merges all the branch2 into the master, that is he first commit three times at branch2, and then merge it to the master branch. For the second case, it is correct to have three objects, but for the first, I am not sure if it should have three objects...
Clarification unpacking the resulting packfile: it has three objects in 339 bytes (sha 3df5e679680a99d18cd09464ed7aa14de86ff3f9) First is commit (type 1), second is tree (type 10) Third is OBJ_REF_DELTA (type 111) It refers to object 2545da8798cf8f8883a3bce475e11e09dd3ca945 Uncompressed, we get 13 bytes. First byte: instructions are 01011110, which means append after 94 bytes of the original object. What is being appended is only the last 8 bytes (\ngenggai), and there are four bytes after the instructions that do not appear to be documented: 102 144 94 8
(I am basing this on https://mirrors.edge.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt)
Util "It refers to object 2545da8798cf8f8883a3bce475e11e09dd3ca945", they are all correct. the appended bytes is also correct. But for the first byte 0101110, I am wondering. According to the blog, it means append 94 bytes to the target object. Can you give me the unpacking program?
I checked it in (including generated packfile): perl readPack.perl packfile
Btw, https://codewords.recurse.com/issues/three/unpacking-git-packfiles seems to describe a different format, though it appears to be less precise
OK, I'll try to locate where the problem is. Also, I'll check the blog.
The OBJ_REF_DELTA type has 20-byte base object name behind the header. readPack.perl seems doesn't consider this, just inflate. It should first skip these bytes, and then inflate
No, don't you see this code: my $hh = substr($str1, 0, 20); $sha = unpack "H*", $hh; print "base sha=$sha\n"; $str1 = substr($str1, 20, length($str1)-20);
Also here is the quote from previous comment:
Uncompressed, we get 13 bytes.
First byte: instructions are 01011110, which means append after 94 bytes of the
original object.
What is being appended is only the last 8 bytes (\ngenggai), and there are
four bytes after the instructions that do not appear to be documented: 102 144 94 8
Also https://codewords.recurse.com/issues/three/unpacking-git-packfiles seems to describe a different packfile format
I didn't find these code... maybe you changed it locally
Should be there now.
In any case, the documentation is inconsistent and clearly wrong. One option is to guess the actual format for delta. Another approach is to find the actual libgit2 code that writes OBJ_REF_DELTA and OBJ_OFF_DELTA (probably a faster approach if the code is not too convoluted). Both formats are not described accurately in the references.
For OBJ_REF_DELTA the actual format is as follows (total 13 bytes):
byte 0: Instruction=01011110
byte 0: size of original object=94
Byte 1: size of new object=102
byte 2: not sure=144
byte 3: offset?=94
byte 4: number of bytes to write from delta?=8
Bytes 5-12=
genggai
Sounds a good idea! I will find its correct format.I have the following guess: byte0: size of the original object, that is 94 bytes byte1: size of the new object, that is 102 bytes byte2-byte12 instructions and compressed data byte2: 144: 10010000, according to https://mirrors.edge.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt it shows it is a copy instruction, and offset is 0, size1 is present, so the following byte, byte 3 is copy bytes number. According to the little endian, we should copy 94 bytes from base object. byte4:8:00001000, it is an insert instruction. We should insert 8 bytes, so the following 8 bytes is data to appended to the target object. Of course, this is guess, I will find its correct format!
Yes, that appears to work for type 7, it is still not clear how type 6 is encoded
I ran either:
echo "082d5e91a657b93e15395d76074316b3b70e57e7;master;0;796" | ./build/batch_fetch https://github.com/pidanself/Testbranch Testbranch txt packfile
or
(echo "082d5e91a657b93e15395d76074316b3b70e57e7;master;0;796"; echo "be5a7ee90f39b3fc9b7fba0042706a6a33051603;branch2;796;255") | ./build/batch_fetch https://github.com/pidanself/Testbranch Testbranch txt packfile
Both produced the same packfile with three objects: is that correct? I think it is correct. Git will not merged branch's commit. Then I made a little modification in branch2, and run both, they produced 6 objects, which is expected. So Git will ignore merged commits.
So for type 6, there is 20 bytes after the header indicates for the base object's hash, You mixed up type 6 and 7.
offset is the base object in the pack file's header position. Its the negative offset as described in the documentation
For example, 10110011 10010001 01001001, offset is ((51+1)<<7+17+1)<<7+73. The a) 51 is not 10110011 but 00110011
I found a bug in readPack.perl (it was not advancing the offset for type 6) now it seems to work (though still does not reconstruct delta objects just yet).
To make the update work, there is still a couple of pieces remain:
Glad to hear that it works! Btw, for 1 & 2, what can I help?
It works for type 7.
For example: zcat /data/update/heads201813/sf201813.prj..heads | grep wsclean | perl -ane 's/^[^;];//;s/;/\n/g;print' | sort -u | ~/lookup/showCmt.perl 128 0 tst/output produces the data (though the names of the branches need to be modified)
But head -1 tst/output.idx | ./build/batch_fetch https://git.code.sf.net/p/wsclean/code wsclean tst/output.bin tst/packfile appears to get too many objects (2560)
pask file has 2560 objects!
I am sorry that I don't understand what you mean. Have you modified the branches' names? The branch name must be correct, otherwise, it is useless, it will only consider these branches don't exist. And will fetch the branch's all new objects. What it should be the correct object numbers? Does your data come from year 2018? I cannot open this link... Btw, I find a bug in the readPack.perl. In line 135-138, when adding up to the offset, if it is the last one, we shouldn't add 1. In the example: 10110011 10010001 01001001 First byte, the first bit is 1, so we get 0110011, i.e. 51, then next byte's first bit is 1, we get 0010001, i.e., 9, next is 0, we get 1001001, i.e., 73. Then ((51+1) << 7 + 9 + 1) << 7 + 73. In the pack file you upload(tat/packfile), the first type=110, the a is 0 74, so the offset is (0+1)*128+74=202, 34102-202=33900, just the last object. And does it works for type 6? If not, what errors happen?
Another example:
echo "48e450556eeaa0f323f164aa26400cc44ba0b94b;master" | ~/lookup/showCmt.perl 128 0 tst/output
cat tst/output.idx | ./build/batch_fetch https://github.com/ssc-oscar/gather lg2 tst/output.bin tst/packfile
Does produce packfile: it should have 8 new commits (and associated objects)
It does have 8 commits, I presume right ones.
The base objects for type 7 are already in the database.
So the only issue remaining is how to extract objects from that packfile.
This should be preferably done in C, but the C code need access to the WoC object datbase.
I tried a more realistic case:
head -1 tst/linux-stable.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/linux-stable.bin tst/linux-stable.pack
It produces 2831895 objects, while only 76652 new objects are needed.
head -1 tst/linux-stable.idx
0f0910a100951204a48052ce62ca72915511ecc6;master;0;1632
has only the master branch, so may be it is necessary to provide more commits for other branches to avoid this massive fetch?
If I provide more braches it seems to get into trouble:
cat tst/linux-stable.idx | grep -v 8433e5c9c8304b750c519| ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/linux-stable.bin tst/linux-stable.pack5
fetch error: Object not found - no match for id (0199619b21f7320482e8a2db14cf8bc974a7766a)
Yes, we need to have all the branches. I have seen the data, I will try to fix the bug.
It is committed
I find the bug. It is because of the input, the input should obey the following format: hash;head's name;offset;length In the tst/linux-stable.idx, the last part is not length, but the end offset. So, it should look like this: 0f0910a100951204a48052ce62ca72915511ecc6;master;0;1632 8433e5c9c8304b750c519ce3e0940dab675f6573;linux-3.18.y;1632;302 8433e5c9c8304b750c519ce3e0940dab675f6573;linux-3.18.y-queue;1934;302 0199619b21f7320482e8a2db14cf8bc974a7766a;linux-4.1.y;2236;301 623dfab42becf5c56c9a31b7eaf90cb6eb86459f;linux-4.1.y-queue;2537;1110 Then it can successfully fetch. In this case, it fetches 2825657 objects, still a lot, but I think if it is because of there are still some branches not provided.
I created a realistic test case of updating one linux kernel repo: tst/ls.idx has commits from an earlier state of the repo (tst/linux-stable.heads.1555007357) and I am trying to update to the current version (tst/linux-stable.heads.1556119740)
Btw, not all commits in tst/linux-stable.heads.1555007357 are in the cloned repo.
cat tst/ls.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/ls.bin tst/ls.pack
tree ada630e1da499723c827ba0ff1084f93daf9ed9c
parent b89e3859db0658df57abfb1396ebad8d1f4580bb
au
thor Steve F ;ch <stf` @microsoft.com> 1552856318 -0500
Result of SHA1 : 8ee9a2d029c9980a3545c2acbeaa8def113f5b88
Segmentation fault
According to the output, I think it is because of the ls.bin file. I download it, but cannot open it locally. I changed it to the UTF-8 format, but it shows messy code.(Using cat command shows the same error). So I am wondering if it was correctly produced? The former .bin file (linux-stable.bin) can be correctly opened. I am a little confused about why "not all commits in tst/linux-stable.heads.1555007357 are in the cloned repo". I checked a case locally. When I download a repo, using find .git/objects will cover all the commits, but fin .git/refs will show only the master branch.
correct, the batch wants uncompressed commits, it would make sense to read compressed data though? can you change batch to read compressed data? (its the zlib that is use by git to compress object/create pack file)?
No there are several object missing, for example these commits are in tst/linux-stable.heads.1555007357 but are not in the currently cloned version of https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable
2c56cc648c953f4c55d215bda8894d2d1af083d0
363a48d6a7728ae167d82c96b8198f30df245e73
3d9c55a7eaf3ab272e3931c0d43a0082d594f745
4015156a8247efed44281818a518c95e37323593
435c43a7aaa2eb50996391fa7ec11945c341d71d
4974ffe3e72f4a065a9b8f01661a378156b94bd8
64c0aa2ee0e25f49da0ff7aaf04595de61f23306
78d31592da78ad793dba5a289d3c93d0edbb58c0
9a7c9255ec3851bb32ced8dbd271acc3ad125bc5
9eeacd838a2e7d3d838b4b4a0808056383d121ae
a017fd9894843a081fe409688ae4d02e907cfbe1
a9ef068a445f4897b8ebe3a1c42c0e7f25d1bb53
aecbffbf4512172fc26f835affbd8c963585f944
b6f2e7667ad28631b07a70e3c45ac089f5db593f
e3a3a99e4c112ec0bf891cdca2a15c068ce7a0de
eb1937f6e059b8e15f3ec51d57549f962e91c01d
edd999994aac4fa9336d3d9a140908f355364d08
f471faf0251d0b8660d0a4c9f8f709183054bbca
Here are results on ucompressed version:
cat tst/tst.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/tst.bin tst/ls.pack
.....
filter_wants head=1359 local=0 id=8b27b23bbd78750df2eb8a5c59ad067acbb0d273 name=refs/tags/v4.1.48 /home/audris/swsc/libgit2/src/fetch.c:141
filter_wants head=1360 local=0 id=0199619b21f7320482e8a2db14cf8bc974a7766a name=refs/tags/v4.1.48^{} /home/audris/swsc/libgit2/src/fetch.c:141
git_fetch_negotiate need 85 /home/audris/swsc/libgit2/src/fetch.c:177
fetch error: Object not found - no match for id (51a60126aea86f259169d74fb1de5ca3d6f6481b)
OK, I'll try to find the error.
Here is an update (nothing should be retrievd as of now) as the tst/1556204601.* has all the the commits and head labels:
cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
tree aca070ed8a759fe2d241e188964c7a43190466dd
parent cc59bae2c6b2ab2bc277fbcf09944f03c8d5a8ed
author Tadeusz Struk <tadeusz.struk@intel.com> 1553711558 -0700
committer Sasha Levin <sashal@kernel.org> 1556069556 -0400
tpm: fix an invalid condition in tpm_common_poll
[ Upstream commit 7110629263469b4664d00b38ef80a656eddf3637 ]
The poll condition should only check response_length,
because reads should only be issued if there is data to read.
The response_read flag only prevents double writes.
The problem was that the write set the response_read to false,
enqued a tpm job, and returned. Then application called poll
which checked the response_read flag and returned EPOLLIN.
Then the application called read, but got nothing.
After all that the async_work kicked in.
Added also mutex_lock around the poll check to prevent
other possible race conditions.
Fixes: 9488585b21bef0df12 ("tpm: add support for partial reads")
Reported-by: Mantas Mikul�nas <grawity@gmail.com>
Tested-by: Mantas Mikul�nas <grawity@gmail.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Result of SHA1 : ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f
tree ada630e1da499723c827ba0ff1084f93daf9ed9c
parent b89e3859db0658df57abfb1396ebad8d1f4580bb
author Steve French <stfrench@microsoft.com> 1552856318 -0500
committer Sasha Levin <sashal@kernel.org> 1553731709 -0400
fix incorrect error code mapping for OBJECTID_NOT_FOUND
[ Upstream commit 85f9987b236cf46e06ffdb5c225cf1f3c0acb789 ]
It was mapped to EIO which can be confusing when user space
queries for an object GUID for an object for which the server
file system doesn't support (or hasn't saved one).
As Amir Goldstein suggested this is similar to ENOATTR
(equivalently ENODATA in Linux errno definitions) so
changing NT STATUS code mapping for OBJECTID_NOT_FOUND
to ENODATA.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Result of SHA1 : 51a60126aea86f259169d74fb1de5ca3d6f6481b
Segmentation fault
Segmentation fault is caused by the fopen function. Git allows a branch name has right slash, i.e '/' in it. But Ubuntu regards '/' as a directory separator. For the 51a60126aea86f259169d74fb1de5ca3d6f6481b branch, its branch name is for-greg/3.18-2. So when writing to the refs/heads, segmentation fault occurred. I also checked that a directory called for-greg will exist in the refs/heads. So I write a function to fopen a file even though its dir doesn't exist.
For change batch to read compressed data, can you upload a test case including the file contains the compressed data and the file contains the index? Thx.
No seg fault, but still does not work:
cat tst/1556204601.idx | ./build/batch_fetch https://git.kernel.org/pub/scm/linux/kernel/git/sashal/linux-stable ls tst/1556204601.bin tst/ls.pack
....
filter_wants head=1383 local=0 id=0199619b21f7320482e8a2db14cf8bc974a7766a name=refs/tags/v4.1.48^{} /home/audris/docker/libgit2/src/fetch.c:141
git_fetch_negotiate need 96 /home/audris/docker/libgit2/src/fetch.c:177
fetch error: Object not found - no match for id (ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f)
however ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f is in tst/1556204601.bin
(btw, tst/1556204601.bin is uncompressed, should it be compressed now?)
I haven't change batch_fetch to be an compresss version now.
I checked tat/1556204601.idx, its format is not correct, branch name should follow the sha1.Did you change it locally? If so, can you upload it? I tested it using
ff7e9697c5c9c4d8b6521e3b6a18669fbecdba7f;queue-5.0;0;1313 51a60126aea86f259169d74fb1de5ca3d6f6481b;for-greg/3.18-2;1313;825 022ee96afba9847ce136484d3a23cf82820e09a4;for-greg/3.18-4;2138;737 f0910a100951204a48052ce62ca72915511ecc6;master;26341;1632 it works well, and downloaded about 400MB data
echo tst,https://github.com/ssc-oscar/tst,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7,968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 | /usr/bin/get_new_commits path: tst url: https://github.com/ssc-oscar/tst new head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 old head: 968cdcf2e6b22fd5f8f95f2c8666f1a976fac0c7 no update! Segmentation fault