Ability to use the (J)CasUtil select* methods on custom indexes (and beyond...)

GoogleCodeExporter commented 8 years ago

As per issue 54, uimaFIT now has support for custom indexes. However, non of 
the select*() methods in (J)CasUtil offer any way of specifying on which index 
to operate. They assume to operate on a (J)Cas and there either use the default 
AnnotationIndex or getAllIndexedFS().

Adding further methods that allow specifying an index instead of a CAS would 
cause an explosion of method calls, in particular in CasUtil where we already 
have like 4 versions of every method: FS/Annotation and type/type name.

I think a good approach would be to move the implementation of all the 
(J)CasUtil methods as non-static methods into a proper class. The index to 
operate on could then be configured for an instance of the class. This would 
allow something like this:

from(cas).select(Token.class)
from("CustomIndex").select(Sentence.class)

or more explicitly:

FeatureStructureCollection<FeatureStructure> container = CasUtil.from(cas);
FeatureStructureCollection<Token> container.select(Token.class);

Driving that refactoring futher I could even imagine something like:

from(cas).select(Token.class).where(and(feature("begin", GREATER_THAN, 5), 
feature("end", LESS_THAN, 10)));

or more explicitly (again no static imports):

FeatureStructureCollection<FeatureStructure> theCas = CasUtil.from(cas);
FeatureStructureCollection<Token> tokens = theCas.select(Token.class);
FeatureStructureCollection<Token> result = tokens.where(
   Predicate.and(
      Predicate.feature(begin", GREATER_THAN, 5), 
      Predicate.feature("end", LESS_THAN, 10)));

Mind that all currently available (J)CasUtil methods should still be available 
but internally be implemented by delegation to the refactored implementation 
suggested above.

Original issue reported on code.google.com by richard.eckart on 18 Mar 2011 at 3:29

GoogleCodeExporter commented 8 years ago

The second line of the first example should read:

from(cas, "CustomIndex").select(Sentence.class)

Original comment by richard.eckart on 18 Mar 2011 at 1:15

GoogleCodeExporter commented 8 years ago

I like:

    from(cas, "CustomIndex").select(Sentence.class)

I'm not sold on:

    from(cas).select(Token.class).where(and(feature("begin", GREATER_THAN, 5), feature("end", LESS_THAN, 10)));

If we're going to go the route of defining predicates over collections, I'd 
prefer that we don't roll our own, and instead use an existing library, e.g. 
Guava collections, where you'd write something like:

    filter(from(cas).select(Token.class), and(feature("begin", GREATER_THAN, 5), feature("end", LESS_THAN, 10))));

So, in short, +1 for (J)CasUtil.from, -1 on our own implementation of filter 
and predicates.

Original comment by steven.b...@gmail.com on 23 Mar 2011 at 10:47

GoogleCodeExporter commented 8 years ago

Guava's filter() method returns a Collection<E>. I would like to keep the 
option open to have the "where" method return something more specialized, e.g. 
something possibly implementing FSIndex and/or being able to produce a 
FSIterator.

It would also introduce an additional dependency.

Original comment by richard.eckart on 25 Mar 2011 at 5:21

GoogleCodeExporter commented 8 years ago

I would also like to keep the option of the where() implementation being smart 
about how it handles the predicates - that is depending on the actual instance 
on which .where() is called, the where() implementation may exploit 
instance-specific optimizations for particular predicates.

Original comment by richard.eckart on 25 Mar 2011 at 5:23

GoogleCodeExporter commented 8 years ago

I guess I'm okay with "select" returning something that has a "where" method 
(or "filter" if we want to match Guava, Scala, etc.). But I don't think we 
should create our own Predicate API.

Or maybe we should just give up on Java and switch over to Scala, where your 
".where" clause could with no extra method definitions be written as:

    .filter(t => t.getBegin > 5 && t.getEnd < 10)

;-)

Original comment by steven.b...@gmail.com on 25 Mar 2011 at 6:04

GoogleCodeExporter commented 8 years ago

Instead of using an external matching framework, the best option is probably to 
use/extend the FSConstraint framework already present in UIMA. (cf. 
CAS.createFilteredIterator()).

Original comment by richard.eckart on 26 Mar 2011 at 12:25

GoogleCodeExporter commented 8 years ago

Never noticed those before. So:

from(cas).select(Token.class).where(and(..., ...));

would be equivalent to:

cas.createFilteredIterator(
  cas.getAnnotationIndex(JCasUtil.getType(jCas, Token.class)).iterator(), 
  ConstraintFactory.and(..., ...));

and we'd provide some utility functions for creating common FSMatchConstraints? 
I'd be okay with that.

Original comment by steven.b...@gmail.com on 26 Mar 2011 at 12:55

GoogleCodeExporter commented 8 years ago

sounds really useful to me.  It sounds a bit like another project I heard a 
talk about which allowed you to do something similar.  

http://uima.apache.org/downloads/sandbox/CFE_UG/CFE_UG.html

Don't feel like you need to give energy to looking into this other project (I 
didn't!)  I just thought I would mention it since it seems related.

Original comment by phi...@ogren.info on 27 Mar 2011 at 2:18

GoogleCodeExporter commented 8 years ago

@comment 7 by Steven

I experimented a bit with the filtered iterator. It works nicely, but it has a 
horrible API. Originally I wanted to perform this (pseudo-SQL):

    SELECT Token FROM cas WHERE coveredText = "a" and begin > 4 and being < 10

However, after debugging a bit, I noticed that there currently no way to access 
the covered text in UIMA's filtered iterator framework. So I changed the 
intended query to pseudo-SQL I would like to perform a

    SELECT Annotation FROM cas WHERE type = Token and begin > 4 and being < 10

In Java, I would probably want to write something like this:

    from(cas).select(Annotation.class).where(and(type(Token.class), gt("begin", 4), lt("begin", 10)))

but instead I have to write:

        // Set up example
        ConstraintFactory cf = ConstraintFactory.instance();
        FSIterator<Annotation> iterator = jcas.getAnnotationIndex().iterator();
        Type tokenType = jcas.getCasType(Token.type);

        // Restrict to Tokens
        FSTypeConstraint typeConstraint = cf.createTypeConstraint();
        typeConstraint.add(tokenType);

        // Restrict to begin > 4 && begin < 10
        FeaturePath beginFeaturePath = cas.createFeaturePath();
        beginFeaturePath.initialize("begin");
        beginFeaturePath.typeInit(tokenType);
        FSIntConstraint beginValueConstraint = cf.createIntConstraint();
        beginValueConstraint.gt(4);
        beginValueConstraint.lt(10);
        FSMatchConstraint beginFeatureConstraint = cf.embedConstraint(beginFeaturePath, beginValueConstraint);

        // Combine both constraints using "and"
        FSMatchConstraint conjunction = cf.and(typeConstraint, beginFeatureConstraint);

        FSIterator<Annotation> filteredIterator = cas.createFilteredIterator(iterator, conjunction);

@comment 8 by Philip

The FESL (Feature Extraction Specification Language) used by the CFE is XML 
based. Seems to me to be nothing you would want to be writing in your Java 
code. Looks like something that could be interesting with regards to ClearTK.

Briefly looking into the source, I find that most of the code is centered 
around parsing FESL-XML and applying it to the CAS - I did not see anything 
that would make life easier to the person coding in Java.

Original comment by whodance...@gmail.com on 3 Apr 2011 at 10:32

GoogleCodeExporter commented 8 years ago

I wasn't suggesting that we actually use the filtered iterator API directly, 
but rather that your SQLCAS object (or whatever you call it) be implemented 
internally using the filtered iterator API. Something like:

class SQLCAS {
  public static SQLCAS from(CAS cas);
  public SQLCAS select(Class<?> cls);
  public SQLCAS where(FSMatchConstraint constraint);
  public FSIterator<?> iterator();
}

And then each of those methods would be implemented using the code you wrote 
above.

Original comment by steven.b...@gmail.com on 3 Apr 2011 at 11:23

GoogleCodeExporter commented 8 years ago

Of course. I just wanted to give a full example of the horrors of that API.

Original comment by whodance...@gmail.com on 3 Apr 2011 at 11:28

GoogleCodeExporter commented 8 years ago

These issues are candidates for version 1.3.0.

Original comment by richard.eckart on 7 May 2011 at 5:31

Added labels: Milestone-1.3.0

GoogleCodeExporter commented 8 years ago

Here's an implementation of something like this that will let you write things 
like:

    DocumentAnnotation document = CasQuery.from(this.jCas).select(DocumentAnnotation.class).single();
    Iterator<Sentence> sentences = CasQuery.from(this.jCas).select(Sentence.class).iterator();
    Collection<Token> tokens = CasQuery.from(this.jCas).select(Token.class).coveredBy(sentence);
    Token token = CasQuery.from(this.jCas).select(Token.class).matching(annotation).single();
    Chunk chunk = CasQuery.from(this.jCas).select(Chunk.class).zeroOrOne();

I was just playing around with this, so do with it as you will, but it might 
make a useful starting point.

Original comment by steven.b...@gmail.com on 10 May 2011 at 2:29

Attachments:

CasQuery.java

GoogleCodeExporter commented 8 years ago

Nice. I'll have a look at it.

Original comment by richard.eckart on 10 May 2011 at 2:45

GoogleCodeExporter commented 8 years ago

Original comment by richard.eckart on 4 Jan 2012 at 10:51

Added labels: Milestone-1.4.0

GoogleCodeExporter commented 8 years ago

Original comment by richard.eckart on 5 Jul 2012 at 4:02

Added labels: Milestone-1.5.0
Removed labels: Milestone-1.4.0

GoogleCodeExporter commented 8 years ago

Original comment by richard.eckart on 7 Jan 2013 at 4:51

Added labels: ASFJira-No

GoogleCodeExporter commented 8 years ago

Original comment by richard.eckart on 25 Aug 2013 at 8:17

Removed labels: Milestone-1.5.0

sudeep87 / uimafit

Ability to use the (J)CasUtil select* methods on custom indexes (and beyond...) #65