ssc-oscar / oscar.py

Python interface for OSCAR data
GNU General Public License v3.0
4 stars 10 forks source link

The difference between .idx and sha1.tch #20

Closed KayGau closed 4 years ago

KayGau commented 4 years ago

I am deploying WoC in Pengcheng, but I'm confused about .idx file and sha1.tch file(such as blob_0.idx and sha1.blob_0.tch). According to the published paper in MSR (World of Code- An Infrastructure for Mining the Universe of Open Source VCS Data)(In III section D Data Storage), I think sha1.tch files use a git object's SHA as key and the object's offset in .idx file. The .idx file records its offset and size in .bin file. But after I see the code in oscar.py, it seems that it doesn't use .idx file at all? So I am confused what does .idx file and sha1.tch file do? Thank you!

audrism commented 4 years ago

Can you be more specific?

.idx files record offsets, but they are useful for sweeps, not random acces

And yes, py library has very limited functionality.

KayGau commented 4 years ago
  1. I am not sure what's the relationship between .bin files, .idx files and .tch files. For example, given a blob SHA1 value, what steps and what files we should take to get its content?
  2. How are .idx files and sha1.tch files organized? And how can we use them?
audrism commented 4 years ago

have you looked at the lookup/README.md? bin/idx is the initial storage, then either offset (in All.sha1o) or content (in All.sha1c) are extracted or both

Also, the recent commit introduced the ability to set locations of the databases in the environment, perhaps you can create a script that generates it for woc servers

KayGau commented 4 years ago

I am reading and understanding the lookup/README.md now. Maybe I missed something important? I will check it. Thank you !

audrism commented 4 years ago

This is the commit I mentioned related to specifying woc paths in api: https://github.com/ssc-oscar/oscar.py/commit/7dcf54a1948b6413668bf32cbcfb1b8448d0e1de

Just keep in mind that py is used only to access data, not to create any of the databases