terrier-org / terrier-core

Terrier IR Platform
http://terrier.org/
Other
253 stars 62 forks source link

case sensitive tag issue in org.terrier.querying.Scope #2

Closed jdongca2003 closed 6 years ago

jdongca2003 commented 7 years ago

In terrier 4.2 branch (line: 82) " String docno = m.getIndex().getMetaIndex().getItem("docno", docid); " If tag in document corpus is upper case "DOCNO", the above line will return empty. It is case sensitive. Although I set "ApplicationSetup.setProperty("TrecDocTags.casesensitive", "false");", it did not work.

If I change the above line to " String docno = m.getIndex().getMetaIndex().getItem("DOCNO", docid); ", everything works.

my corpus doc is something like:

<DOC>
<DOCNO>345-1</DOCNO>
an inning , or innings , is a fixed length segment of a game in any of a variety of sports most notably cricket and baseball 
during which one team attempts to score while the other team attempts to prevent the first from scoring
</DOC>

Some java codes:

       ApplicationSetup.setProperty("querying.postprocesses.order", "org.terrier.querying.QueryExpansion");
        ApplicationSetup.setProperty("querying.postprocesses.controls", "qe:QueryExpansion");
        ApplicationSetup.setProperty("querying.postfilters.order", "SimpleDecorate,SiteFilter,Scope");
        ApplicationSetup.setProperty("querying.postfilters.controls", "decorate:SimpleDecorate,site:SiteFilter,scope:Scope");
        ApplicationSetup.setProperty("querying.allowed.controls", "qe,qemodel,start,end,site,scope");
        ApplicationSetup.setProperty("TrecDocTags.casesensitive", "false");

       Manager queryingManager = new Manager(index);
        //Create a search request
        SearchRequest srq = queryingManager.newSearchRequestFromQuery("how many innings in overtime in baseball");
        // Specify the model to use when searching
        srq.addMatchingModel("Matching","BM25");

        // Turn on decoration for this search request
        srq.setControl("decorate", "on");
        srq.setControl("scope", "445");
cmacdonald commented 7 years ago

Usually document properties during indexing are recorded using lowercase, i.e. "docno". This means that metaindex lookups use lowercase keys, e.g. see https://github.com/terrier-org/terrier-core/blob/4.2/src/trec/org/terrier/structures/outputformat/TRECDocnoOutputFormat.java#L96

What Collection implementation did you index with? Can you tell us the value of the index.meta.key-names property in your index's data.properties file?

jdongca2003 commented 7 years ago

Professor Crag, thank for quick response. I just indexed TREC_QA (http://www.aclweb.org/aclwiki/index.php?title=Question_Answering_(State_of_the_art)) I like to compare the traditional IR methods with deep QA method. where QID: question_id AID: response_ID

etc/terrier.properties is something like

#default controls for query expansion
querying.postprocesses.order=QueryExpansion
querying.postprocesses.controls=qe:QueryExpansion
#default controls for the web-based interface. SimpleDecorate
#is the simplest metadata decorator. For more control, see Decorate.
querying.postfilters.order=SimpleDecorate,SiteFilter,Scope
querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope

#default and allowed controls
querying.default.controls=
querying.allowed.controls=scope,qe,qemodel,start,end,site,scope

#document tags specification
#for processing the contents of
#the documents, ignoring DOCHDR
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR
#set to true if the tags can be of various case
TrecDocTags.casesensitive=false
TrecDocTags.propertytags=DOCNO,QID,AID,LABEL
indexer.meta.forward.keys=DOCNO,QID,AID,LABEL
indexer.meta.forward.keylens=10,10,10,10
trec.querying.outputformat.docno.meta.key=DOCNO

#query tags specification
TrecQueryTags.doctag=TOP
TrecQueryTags.idtag=NUM
TrecQueryTags.process=TOP,NUM,TITLE
TrecQueryTags.skip=DESC,NARR

#stop-words file
stopwords.filename=stopword-list.txt

#the processing stages a term goes through
termpipelines=Stopwords,PorterStemmer
cmacdonald commented 7 years ago

Hi

TrecDocTags.propertytags should not contain DOCNO I think. In fact it mentions tags not present in your example document?

indexer.meta.forward.keys should be docno not DOCNO.

I accept some of these various properties are too confusing and we are considering ways to simplify things.

Craig

Sent from my iPhone

On 9 Mar 2017, at 18:47, JIANXIONG DONG notifications@github.com<mailto:notifications@github.com> wrote:

TrecDocTags.propertytags=DOCNO,QID,AID,LABEL