rug-compling / alpinocorpus

Library for handling Alpino corpora
GNU Lesser General Public License v2.1
8 stars 1 forks source link

Access xml node value #4

Closed jelmervdl closed 13 years ago

jelmervdl commented 13 years ago

For the statistics window in Dact it would be really useful if we could access the value of the maching nodes. So for example I could create a query //node[@pt="ww"]/@root and I could query the iterator for this query for the filename of the xml file that matched, and for the string value of @root of the matched node.

edit: to do this correctly, it might be best to merge the search functionality that is now divided between XPathMapper in Dact and runQuery for the dbxml corpuses into alpinocorpus. Let's support runQuery for all the corpus types.

jelmervdl commented 13 years ago

I am experimenting a bit with this (and got it working for dbxml corpora) but it gets weird when you use this iterator for both the query values and for the files, which the default begin() and end() to. Should we separate them?

edit: what we want: CorpusReader::begin() -> iterate over all filenames CorpusReader::query() -> iterate over all matched xml values (e.g. for the statistics window) /and/ iterate over all the files containing a match (e.g. for the file list)

The second use of query() could be implemented by an adaptor around the query iterator to turn it into an entry iterator.

larsmans commented 13 years ago

I've considered making the value available from the iterator as well, but didn't because it indeed gets either very awkward or very slow for the "ordinary" begin() iterator, so by design you need both a CorpusReader and an iterator so you can do reader.get(*iter). (This is a deviation from C++ standard iterators.)

I think keeping a single iterator interface class is a Good Thing(TM) from a software reuse point of view.

Can you tell more specifically why you need this, instead of just keeping a reference to the reader around? I could look into the statistics window code this weekend.

jelmervdl commented 13 years ago

I want to use the results from the query directly. Now I request the file from dbxml and load it with libxml to run the same query again just to get to the matching nodes.

larsmans commented 13 years ago

Misread the question, sorry.

Added a contents method to EntryIterator in 1cf7baaeb05024acbcdf, that should be able to do this. It should return a QString with specifically the matching part for an iterator constructed with query.

danieldk commented 13 years ago

Can we close this bug?

Potential issue: what value should we return when the node has no content? Now we return a null QString.

jelmervdl commented 13 years ago

I think that is the expected behavior, but I haven't seen it happen yet because empty attributes do not occur in corpora?

The statistics window in Dact uses this functionality, and it seems to work quite nicely.

danieldk commented 13 years ago

Indeed, we currently do not have any cases where it is used.