osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)
13 stars 3 forks source link

Roubst04 (Disk4/5) manifest #28

Closed lintool closed 5 years ago

lintool commented 5 years ago

Attached is the output of $ find . -type f | sort | xargs md5sum

Please let me know if your copy is different in non-trivial ways (e.g., name casing).

disk45.md5.txt

amallia commented 5 years ago

My copy has only 4 files: fbis.gz fr.gz ft.gz latimes.gz

As far as I know Robust04 does not contain cr. From TREC website:

The document collection for the Robust track is the set of documents on both TREC Disks 4 and 5 minus the the Congressional Record on disk 4.


Source          # Docs    Size (MB)
Financial Times         210,158     564
Federal Register 94      55,630 395
FBIS, disk 5        130,471     470
LA Times                131,896     475
Total Collection:       528,155    1904

Source: https://trec.nist.gov/data/robust/04.guidelines.html
lintool commented 5 years ago

Yes, the disks had CR on them, but CR is not part of the evaluation. What I've uploaded is the manifest of the complete disks... I'm assuming systems will suppress CR themselves...

amallia commented 5 years ago

I have been thinking about this and I believe it will simplify our work if we can assume that whatever files are contained by Roubust04 folder are the only ones that are actually needed.

For example, if the collection name provided is Roubust04 I would expect to have a folder /input/collections/Roubust04 which contains only the .gz files needed (any number of files) and does not contain anything related to cr.

In the following examples, Jassv2 is indexing on a file-by-file approach, while Anserini is doing it on a folder base. Naturally Anserini will have a bigger index, but this is due to the fact that is indexing more than needed (not really fair I guess...).

https://github.com/osirrc2019/jassv2-docker/blob/15d106970d88d2807621f5fec7b9d0acfcca9da2/index_robust04#L7

https://github.com/osirrc2019/anserini-docker/blob/e7ede77ffa73f5f0092e67576ec074b7f27432b7/index#L19

lintool commented 5 years ago

But the potential issue is that this would make it harder to convey the contents of the directory. We can't share the files directly, but we can assume that everyone can get hold of the data from NIST...

amallia commented 5 years ago

This is fine as long as we know what the structure is... How about we add it in the Readme?

lintool commented 5 years ago

Can you take the manifest attached to this issue, find somewhere reasonable in the repo to put it, and send a PR?

amallia commented 5 years ago

I am very confused by the provided list of files. I am wondering if we can you a newer version for this workshop.

Here a couple of odd examples:

lintool commented 5 years ago

Hrm. This is what I have in my copy (copied from original disks 4+5)... can someone else e.g., @andrewtrotman who also has access to the original disks either verify?

I run uncompress and it seems to work fine...

$ uncompress -c fr941003.0z | head
<DOC>
<DOCNO> FR941003-0-00001 </DOCNO>
<PARENT> FR941003-0-00001 </PARENT>
<TEXT>

<!-- PJG FTAG 4700 -->

<!-- PJG STAG 4700 -->

<!-- PJG ITAG l=90 g=1 f=1 -->
...
arjenpdevries commented 5 years ago

Yes the cdroms had compressed files (.Z)

I can check later. I guess some ppl just got the collection somehow in different distribution format...

lintool commented 5 years ago

@arjenpdevries can you check if you copy has the weird file names?

arjenpdevries commented 5 years ago

At least it is not called roubst :-)

My copy has exactly the same list of files (or more), validated using:

ln -s TREC_VOL5 disk5
ln -s TREC_VOL_4 disk4
cut -d ' ' -f3 disk45.md5.txt | xargs ls > /dev/null 

Note that the cdroms had weird inconsistent labels (trying to prove I'm an old dog).

lintool commented 5 years ago

@amallia does this address your concerns? just plow through using deflate and you should be fine...?

arjenpdevries commented 5 years ago

PS:

[arjen@apc TREC]$ zcat ./disk4/cr/hfiles/readmeh.z
A Note to the User

The material on this disk is copyrighted and is subject to the terms and 
conditions of the TREC-96 Information-Retrieval Text Research Collection User 
Agreement, which must be signed in order to obtain a copy of the CD-ROM on 
which this data is to be found.

The changes between the original material as it came from the publisher and the 
version on this disk is detailed in the following file: readmeh.

[...]

The datasets have all been compressed using the UNIX compress utility and are 
stored in chunks of about 1 megabyte each (uncompressed size).

[..]

Special thanks should go to Dean Wilder at the Library of Congress for 
providing the data.
arjenpdevries commented 5 years ago

I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.

amallia commented 5 years ago

I do not think there is an easy rule that sais "newsfile" or "readme / other" based on the filename.

This one was my main concern, but I guess I can index everything...at least for now.

andrewtrotman commented 5 years ago

Well, yes and no.

The original filename on the CD-ROMs are in uppercase (on my CD-ROMs). Since uncompress requires an uppercase .Z extension, I can’t use uncompress on the files with the same name as the manifest Jimmy sent. So if I copy fr941003.0z to fr941003.0z.Z and then uncompress -c fr941003.0z.Z | head then I get the same as Jimmy.

Following on this thread, the directories include CR - which we must exclude for robust04. They also include readme and DTD files and a load of other gunk - which we must exclude. For ATIREI use the following file list to index the collection without any of the other gunk:

$COLLECTION/disk4/fr94/01 $COLLECTION/disk4/fr94/02 $COLLECTION/disk4/fr94/03 $COLLECTION/disk4/fr94/04 $COLLECTION/disk4/fr94/05 $COLLECTION/disk4/fr94/06 $COLLECTION/disk4/fr94/07 $COLLECTION/disk4/fr94/08 $COLLECTION/disk4/fr94/09 $COLLECTION/disk4/fr94/10 $COLLECTION/disk4/fr94/11 $COLLECTION/disk4/fr94/12 $COLLECTION/disk4/ft/ft911 $COLLECTION/disk4/ft/ft921 $COLLECTION/disk4/ft/ft922 $COLLECTION/disk4/ft/ft923 $COLLECTION/disk4/ft/ft924 $COLLECTION/disk4/ft/ft931 $COLLECTION/disk4/ft/ft932 $COLLECTION/disk4/ft/ft933 $COLLECTION/disk4/ft/ft934 $COLLECTION/disk4/ft/ft941 $COLLECTION/disk4/ft/ft942 $COLLECTION/disk4/ft/ft943 $COLLECTION/disk4/ft/ft944 $COLLECTION/disk5/fbis/fb $COLLECTION/disk5/latimes/la

In a seperate post I’ll send the C++ source code that I use to uncompress the files before processing. I do it all in a single pipeline in the ATIRE indexing process so I read the .z (or .0z or .1z, etc) file, uncompress it and then break it into documents and index each all in the pipeline.

Andrew.

On 11/04/2019, at 5:54 AM, Jimmy Lin notifications@github.com wrote:

Hrm. This is what I have in my copy (copied from original disks 4+5)... can someone else e.g., @andrewtrotman https://github.com/andrewtrotman who also has access to the original disks either verify?

I run uncompress and it seems to work fine...

$ uncompress -c fr941003.0z | head

FR941003-0-00001 FR941003-0-00001 ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
andrewtrotman commented 5 years ago

Here’s the C++ code I use to turn .Z files into text:

/* unlzw version 1.4, 22 August 2015

Copyright (C) 2014, 2015 Mark Adler

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.

Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

    Mark Adler madler@alumni.caltech.edu */

/ Version history: 1.0 28 Sep 2014 First version 1.1 1 Oct 2014 Cast before shift of bit buffer for portability Use fastest 32-bit type for bit buffer, uint_fast32_t Use uint_least16_t in case a 16-bit type is not available 1.2 3 Oct 2014 Clean up comments, consolidate return values 1.3 20 Aug 2015 Assure no out-of-bounds access on invalid input 1.4 22 Aug 2015 Return uncompressed data so far on error conditions Be more permissive on where the input is allowed to end /

include

include

/ Type for accumulating bits. 23 bits of the register are used to accumulate up to 16-bit symbols. / typedef uint_fast32_t bits_t;

/ Double size_t variable n, saturating at the maximum size_t value. /

define DOUBLE(n) \

do { \
    size_t was = n; \
    n <<= 1; \
    if (n < was) \
        n = (size_t)0 - 1; \
} while (0)

/ Decompress compressed data generated by the Unix compress utility (LZW compression, files with suffix .Z). Decompress in[0..inlen-1] to an allocated buffer (out)[0..outlen-1]. The length of the uncompressed data in the allocated buffer is returned in outlen. unlzw() returns zero on success, negative if the compressed data is invalid, or 1 if out of memory. The negative return values are -1 for an invalid header, -2 if the first code is not a literal or if an invalid code is detected, and -3 if the stream ended in the middle of a code. -1 means that the data was not produced by Unix compress, -2 generally means random or corrupted data, and -3 generally means prematurely terminated data. If the decompression results in a proper zero-length output, then unlzw() returns zero, outlen is zero, and out is NULL. On error, any decompressed data up to that point is returned using out and outlen. / static int unlzw(unsigned const char in, size_t inlen, unsigned char *out, size_t outlen) { unsigned bits; / current number of bits per code (9..16) / unsigned mask; / mask for current bits codes = (1<<bits)-1 / bits_t buf; / bit buffer -- holds up to 23 bits / unsigned left; / bits left in buf (0..7 after code pulled) / size_t next; / index of next input byte in in[] / size_t mark; / index where last change in bits began / unsigned code; / code, table traversal index / unsigned max; / maximum bits per code for this stream / unsigned flags; / compress flags, then block compress flag / unsigned end; / last valid entry in prefix/suffix tables / unsigned prev; / previous code / unsigned final; / last character written for previous code / unsigned stack; / next position for reversed string / unsigned char put; / allocated output buffer / size_t size; / size of put[] allocation / size_t have; / number of bytes of data in put[] / int ret = 0; / return code / / memory for unlzw() -- the first 256 entries of prefix[] and suffix[] are never used, so could have offset the index but it's faster to waste a little memory / uint_least16_t prefix[65536]; / index to LZW prefix string / unsigned char suffix[65536]; / one-character LZW suffix / unsigned char match[65280 + 2]; / buffer for reversed match */

/* initialize output for error returns */
*out = NULL;
*outlen = 0;

/* process the header */
if (inlen < 3 || in[0] != 0x1f || in[1] != 0x9d)
    return -1;                          /* invalid header */
flags = in[2];
if (flags & 0x60)
    return -1;                          /* invalid header */
max = flags & 0x1f;
if (max < 9 || max > 16)
    return -1;                          /* invalid header */
if (max == 9)                           /* 9 doesn't really mean 9 */
    max = 10;
flags &= 0x80;                          /* true if block compress */

/* clear table, start at nine bits per symbol */
bits = 9;
mask = 0x1ff;
end = flags ? 256 : 255;

/* set up: get the first 9-bit code, which is the first decompressed byte,
   but don't create a table entry until the next code */
if (inlen == 3)
    return 0;                           /* zero-length input is ok */
buf = in[3];
if (inlen == 4)
    return -3;                          /* a partial code is not ok */
buf += in[4] << 8;
final = prev = buf & mask;              /* code */
buf >>= bits;
left = 16 - bits;
if (prev > 255)
    return -2;                          /* first code must be a literal */

/* we have output -- allocate and set up an output buffer four times the
   size of the input (Unix compress usually compresses less than 4:1, so
   this will avoid a reallocation most of the time) */
size = inlen;
DOUBLE(size);
DOUBLE(size);
put = (unsigned char *)malloc(size);
if (put == NULL)
    return 1;
put[0] = final;                         /* first decompressed byte */
have = 1;

/* decode codes */
mark = 3;                               /* start of compressed data */
next = 5;                               /* consumed five bytes so far */
stack = 0;                              /* empty stack */
while (next < inlen) {
    /* if the table will be full after this, increment the code size */
    if (end >= mask && bits < max) {
        /* flush unused input bits and bytes to next 8*bits bit boundary
           (this is a vestigial aspect of the compressed data format
           derived from an implementation that made use of a special VAX
           machine instruction!) */
        {
            unsigned rem = (next - mark) % bits;
            if (rem) {
                rem = bits - rem;
                if (rem >= inlen - next)
                    break;
                next += rem;
            }
        }
        buf = 0;
        left = 0;

        /* mark this new location for computing the next flush */
        mark = next;

        /* increment the number of bits per symbol */
        bits++;
        mask <<= 1;
        mask++;
    }

    /* get a code of bits bits */
    buf += (bits_t)(in[next++]) << left;
    left += 8;
    if (left < bits) {
        if (next == inlen) {
            ret = -3;               /* partial code (not ok) */
            break;
        }
        buf += (bits_t)(in[next++]) << left;
        left += 8;
    }
    code = buf & mask;
    buf >>= bits;
    left -= bits;

    /* process clear code (256) */
    if (code == 256 && flags) {
        /* flush unused input bits and bytes to next 8*bits bit boundary */
        {
            unsigned rem = (next - mark) % bits;
            if (rem) {
                rem = bits - rem;
                if (rem > inlen - next)
                    break;
                next += rem;
            }
        }
        buf = 0;
        left = 0;

        /* mark this new location for computing the next flush */
        mark = next;

        /* go back to nine bits per symbol */
        bits = 9;                       /* initialize bits and mask */
        mask = 0x1ff;
        end = 255;                      /* empty table */
        continue;                       /* get next code */
    }

    /* process LZW code */
    {
        unsigned temp = code;           /* save the current code */

        /* special code to reuse last match */
        if (code > end) {
            /* Be picky on the allowed code here, and make sure that the
               code we drop through (prev) will be a valid index so that
               random input does not cause an exception. */
            if (code != end + 1 || prev > end) {
                ret = -2;               /* invalid LZW code */
                break;
            }
            match[stack++] = final;
            code = prev;
        }

        /* walk through linked list to generate output in reverse order */
        while (code >= 256) {
            match[stack++] = suffix[code];
            code = prefix[code];
        }
        match[stack++] = code;
        final = code;

        /* link new table entry */
        if (end < mask) {
            end++;
            prefix[end] = prev;
            suffix[end] = final;
        }

        /* set previous code for next iteration */
        prev = temp;
    }

    /* make room for the stack in the output */
    if (stack > size - have) {
        if (have + stack + 1 < have) {
            ret = 1;
            break;
        }
        do {
            DOUBLE(size);
        } while (stack > size - have);
        {
            unsigned char *mem = (unsigned char *)realloc(put, size);
            if (mem == NULL) {
                ret = 1;
                break;
            }
            put = mem;
        }
    }

    /* write output in forward order */
    do {
        put[have++] = match[--stack];
    } while (stack);

    /* stack is now empty (zero) for the next code */
}

/* return the decompressed data, first reducing the allocated memory */
{
    unsigned char *mem = (unsigned char *)realloc(put, have);
    if (mem != NULL)
        put = mem;
}
*out = put;
*outlen = have;
return ret;

}

int unlzw(unsigned char *out, size_t outlen, unsigned char str, int str_len) { const char errmsg[] = { "Prematurely terminated compress stream", / -3 / "Corrupted compress stream", / -2 / "Not a Unix compress (.Z) stream", / -1 / "Unexpected return code", / < -3 or > 1 / "Out of memory" / 1 / };

  return unlzw(str, str_len, out, outlen);

}

ryan-clancy commented 5 years ago

Closing this as https://github.com/osirrc/jig/pull/94 is adding directory tree and hashes for all collections.