sudeep87 / uimafit

Automatically exported from code.google.com/p/uimafit
0 stars 0 forks source link

Support for iterating over FeatureStructures #58

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I keep needing the inverse of selectCovered, that is:

    List<T> selectCovering(JCas, Class<T>, Annotation)

I also keep wanting variants of the methods that take a begin index and an end 
index instead of an Annotation (particularly useful when reading standoff 
annotations in some non-UIMA format), e.g.:

    iterate(JCas, Class<T>, int, int)
    selectCovered(JCas, Class<T>, int, int)
    selectCovering(JCas, Class<T>, int, int)

I also want support for iterating over general FeatureStructures, not just 
Annotations. Something like:

  public static <T extends FeatureStructure> Iterator<T> iterator(
      final JCas jCas,
      final Class<T> type) {
    final Type casType = JCasUtil.getType(jCas, type);
    return new Iterator<T>() {
      private FSIterator<?> iter = jCas.getIndexRepository().getAllIndexedFS(casType);
      public boolean hasNext() {
        return this.iter.hasNext();
      }
      public T next() {
        return type.cast(this.iter.next());
      }
      public void remove() {
        throw new UnsupportedOperationException();
      }
    };
  }

  public static <T extends FeatureStructure> Iterable<T> iterate(
      final JCas jCas,
      final Class<T> type) {
    return new Iterable<T>() {
      public Iterator<T> iterator() {
        return JCasUtil.iterator(jCas, type);
      }
    };
  }

Finally, I think we should have methods for iterating over FSArrays. Something 
like:

  public static <T extends FeatureStructure> Iterator<T> iterator(
      final FSArray fsArray,
      final Class<T> type) {
    return new Iterator<T>() {
      private int index = 0;
      public boolean hasNext() {
        return this.index < fsArray.size();
      }
      public T next() {
        T next = type.cast(fsArray.get(this.index));
        this.index += 1;
        return next;
      }
      public void remove() {
        throw new UnsupportedOperationException();
      }
    };
  }

  public static <T extends FeatureStructure> Iterable<T> iterate(
      final FSArray fsArray,
      final Class<T> type) {
    return new Iterable<T>() {
      public Iterator<T> iterator() {
        return JCasUtil.iterator(fsArray, type);
      }
    };
  }

I don't have the time to add these and tests for them right now, but I thought 
I should make an issue so that it isn't forgotten.

And of course, ideally, these should all be written as CasUtil methods and then 
mirrored in JCasUtil.

Original issue reported on code.google.com by steven.b...@gmail.com on 16 Mar 2011 at 9:43

GoogleCodeExporter commented 8 years ago
The selectCovering() is a quite slow operation, that is why it is not 
implemented in (J)CasUtil. If you need that operation, you should use the 
ContainmentIndex, which builds a internal cache for which annotation contains 
which other annotation once and a lookup then is fast.

I'll check what happens to our test-cases when I change the stuff to 
FeatureStructure.

Original comment by richard.eckart on 16 Mar 2011 at 1:30

GoogleCodeExporter commented 8 years ago
Btw. for JCasUtil I think TOP is more appropriate since FeatureStructure, as 
that is the root of the JCas class hierarchy. FeatureStructure is appropriate 
for CasUtil.

Maybe "ContainmentIndex" should be renamed to "CoverageCache" or 
"CoverageIndex" or something like that to be more obvious and might be useful 
to have a factory method in (J)CasUtil as not to miss it.

Original comment by richard.eckart on 16 Mar 2011 at 1:43

GoogleCodeExporter commented 8 years ago

Original comment by richard.eckart on 16 Mar 2011 at 1:46

GoogleCodeExporter commented 8 years ago
Yeah, I wondered about the efficiency of selectCovering. And ContainmentIndex 
looks like the solution I need - I never noticed it before. I think 
ContainmentIndex as a name is fine - but a few links from (J)CasUtil would be 
quite helpful. Perhaps the factory method ContainmentIndex.create should be 
moved or copied into (J)CasUtil as createContainmentIndex?

I don't feel strongly about TOP vs. FeatureStructure. I just used 
FeatureStructure because FSIterator<FeatureStructure> is the return type of 
getAllIndexedFS().

Original comment by steven.b...@gmail.com on 16 Mar 2011 at 1:59

GoogleCodeExporter commented 8 years ago
I have made the necessary changes on my machine (AnnotationFS to 
FeatureStructure) where possible. However, this is not completely backwards 
compatible. With uimaFIT 1.1.0 you can write

        for (AnnotationFS token : select(aCas, type)) {

with the modification you have to write

        Collection<AnnotationFS> tokens = select(aCas, type);
        for (AnnotationFS token : tokens) {

This is because Java is not smart enough to infer the return type of select() 
in a enhanced for loop (e.g. to Iterable<AnnotationFS>) and simply assumes the 
dumbest default, which is that select() returns a Iterable<FeatureStructure>, 
which is not what you would want here.

I'm not sure yet how I like that. There are several options:

1) have selectFS() return FS and select() return AnnotationFS
2) add another parameter to select() where you can specify the return type in 
cases where Java is not smart enough to infer the type
3) force users to write two lines instead of one (see above)

Briefly thinking about it, I tend towards 1).

I would also like to point out that getAllIndexedFS() is a bit magical on the 
inside regarding where it picks up the FSes and in which order they are 
returned when custom indexes are installed, while the getAnnotationIndex() 
stuff reliably returns offset-sorted annotations. This also drives me towards 
solution 1).

Original comment by richard.eckart on 16 Mar 2011 at 4:51

GoogleCodeExporter commented 8 years ago
Hmm... That's odd - the code I posted above for iterate works perfectly fine in 
a for loop for me. What does your code look like?

That said, I'm also fine with selectFS() and select().

Original comment by steven.b...@gmail.com on 16 Mar 2011 at 6:13

GoogleCodeExporter commented 8 years ago
The problem occurs only in CasUtil, not in JCasUtil. In JCasUtil, Java can 
infer the return type from the "type" parameter which is a Class. In CasUtil, 
Java cannot do that because the type parameter is a UIMA Type. So I am adding 
selectFS and select to CasUtil and in JCasUtil such a differentiation will not 
be made.

However, this addresses only the loop thing. I am still a bit wary about the 
voodoo that getAllIndexedFS() performs. At least our test cases at least are 
all fine and so are those of DKPro. However, we have no test cases that include 
any custom indexes.

I'll commit the changes nevertheless.

Original comment by richard.eckart on 16 Mar 2011 at 6:51

GoogleCodeExporter commented 8 years ago
JCasUtil.select() and iterate() methods now can return anything that inherits 
from TOP.
CasUtil now contains an additional set of methods having "FS" in their name 
(e.g. selectFS()) that can select any FeatureStructures.

Original comment by richard.eckart on 16 Mar 2011 at 6:53

GoogleCodeExporter commented 8 years ago
Cool. Thanks for the quick fixes! You addressed the selectCovering issue (and 
it's being further addressed in Issue 60) and the FeatureStructure vs. 
Annotation issues, but I don't think we can close this issue until we add the 
other two items:

* Ability to specify container in terms of begin and end offset instead of just 
as a covering Annotation
* Iteration over FSArrays

Original comment by steven.b...@gmail.com on 16 Mar 2011 at 7:39

GoogleCodeExporter commented 8 years ago
Whops, forgot those ;)

Original comment by richard.eckart on 16 Mar 2011 at 8:17

GoogleCodeExporter commented 8 years ago
Moved them into separate issues, which makes the changes more prominent in the 
CHANGES file.

Original comment by richard.eckart on 16 Mar 2011 at 8:24

GoogleCodeExporter commented 8 years ago
Great, thanks!

Original comment by steven.b...@gmail.com on 16 Mar 2011 at 8:32

GoogleCodeExporter commented 8 years ago

Original comment by richard.eckart on 7 Apr 2011 at 12:12