Closed GoogleCodeExporter closed 9 years ago
Original comment by steven.b...@gmail.com
on 24 Jul 2012 at 5:35
I've started taking a first pass at coding this up, but I'm getting hung up on
is how do you tie the List<Feature> back to the sentence in the CAS? It seems
straightforward enough to manipulate your Set<List<Feature>> to produce either
a ranked list or to select a subset for extraction, but there doesn't really
seem to be a good mechanism to point this instance back to the source sentence.
Some possible, but less than perfect solutions include:
* Making it a List<List<Feature>> and relying on the strict ordering to know
which sentences to label as extracted. This feels cumbersome.
* Somehow hashing the sentence text into a UUID, and storing that as feature
for later lookup. Or instead of a UUID, use the document URI plus the sentence
span to uniquely identify the sentence. It seems a little ugly to store
meta-data as a feature, as you really don't want to expose this to something
like an SVM. Alternatively, this could possibly be handled via a new
TransformableFeature extractor that stores metadata during training and that
returns an empty list during classification.
Original comment by lee.becker
on 7 Sep 2012 at 5:24
Assuming the summarizer requires sentence segmentation, tokenization, stemming,
and pos-tags, is there any way to avoid re-running the preprocessing for the
classification stage? While it's possible to write out xmis at the end of
training, and read them back in with an XMIReader, that doesn't seem much
better than re-running everything.
What I really want is something like JCasIterable that I can iterate over
multiple times.
Original comment by lee.becker
on 7 Sep 2012 at 5:37
So you don't need to store any link back to the sentence. Your classifier can
look like:
private Set<List<Feature>> selectedSentences; // the actual model
public Boolean classify(List<Feature> sentence) {
return selectedSentences.contains(sentence);
}
As to your second question, I don't understand what's wrong with the XMI
approach. I would just:
(1) Run all the preprocessing and save the XMIs
(2) Run the training on the XMIs
(3) Run the classification on the XMIs
JCasIterable won't really work for you because it only takes one pass through
the JCases. Of course, you could look at the source fo JCasIterable (it's
pretty simple, really) and design a two-pass version for yourself. But I don't
think that would be any simpler than just doing the standard XMI approach...
Original comment by steven.b...@gmail.com
on 7 Sep 2012 at 8:25
An initial version of this was checked in with revision
4bb30747a5f81bababfd80107d763cba4e0592bc. It hasn't been thoroughly tested in
the wild yet, but as of Issue 370 the cleartk-summarization code all has @Beta
annotations, so I think we can safely mark this issue as resolved. If there
needs to be future development on cleartk-summarization, then we'll open up new
issues for that.
Original comment by steven.b...@gmail.com
on 19 Jul 2013 at 6:22
Original issue reported on code.google.com by
steven.b...@gmail.com
on 20 Mar 2012 at 12:42