osirrc / ciff

Common Index File Format to to support interoperability between open-source IR engines
http://ciff.osirrc.io/
31 stars 3 forks source link

Create very simple export for testing purposes #12

Open lintool opened 4 years ago

lintool commented 4 years ago

@JMMackenzie and @Chriskamphuis have requested a sample export for testing purposes.

I propose exporting the index from this Anserini test case: https://github.com/castorini/anserini/blob/master/src/test/java/io/anserini/integration/TrecEndToEndTest.java

which indexes this 3 document toy collection: https://github.com/castorini/anserini/tree/master/src/test/resources/sample_docs/trec/collection2

sg?

chriskamphuis commented 4 years ago

sounds good

JMMackenzie commented 4 years ago

Perfect!

lintool commented 4 years ago

toy-complete-20200309.ciff.gz

Reading header...
=== Header === 
version: 1
num_postings_lists: 9
num_doc_records: 3
total_postings_lists: 9
total_docs: 3
total_terms_in_collection: 16
average_doclength: 5.333333
description: Export of toy 3-document collection from Anserini's io.anserini.integration.TrecEndToEndTest test case

Expecting 9 postings lists and 3 doc records in this export.
term: '01', df=1, cf=1 (0, 1)
term: '03', df=1, cf=1 (0, 1)
term: '30', df=1, cf=1 (0, 1)
term: 'content', df=1, cf=1 (0, 1)
term: 'enough', df=1, cf=1 (2, 1)
term: 'head', df=3, cf=3 (0, 1) (1, 1) (1, 1)
term: 'simpl', df=2, cf=2 (1, 1) (1, 1)
term: 'text', df=3, cf=5 (0, 1) (1, 1) (1, 3)
term: 'veri', df=1, cf=1 (1, 1)
0   WSJ_1   6
1   TREC_DOC_1  4
2   DOC222  6
lintool commented 4 years ago

TODO: encode above as a test case.

cmacdonald commented 4 years ago

might be nice to have another file that demonstrates the "Query terms only" case, i.e. num_postings_lists < total_postings_lists, and other relevant statistics