htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Does htrc-feature-reader fully support HTRC schema 2.0? #2

Closed wilkens closed 8 years ago

wilkens commented 8 years ago

When I run the basic FeatureReader example code over current (December 5, 2015) HTRC extracted feature data, it raises a warning:

WARNING:root:Schema version of imported (2.0) file does not match the supported version (1.0)

For the simple case that I've tried so far, just extracting vol.id, vol.year, and vol.PageCount, things seem to work fine. But can I expect problems elsewhere? I've looked for a description of the schema changes in 2.0, but can't find anything.

Thanks -- this package really does make it worlds easier to deal with the extracted features data!

wilkens commented 8 years ago

Hmm, yes, I think I've encountered trouble. With the package imported and list of paths created correctly, the code below chokes when iterating over pages:

feature_reader = FeatureReader(paths)
for vol in feature_reader:
    pagecount = 0
    print(vol.id)
    for page in vol.pages(default_section='body'):
        pagecount += 1
        print(pagecount)
    print('Pages in %s: %s', (i, pagecount))
    processed += 1
print('Total volumes processed:', processed)

The output is:

uc2.ark:/13960/t4jm2cj7x
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-24-11585004c0cd> in <module>()
      3     pagecount = 0
      4     print(vol.id)
----> 5     for page in vol.pages(default_section='body'):
      6         pagecount += 1
      7         print(pagecount)

/Users/mwilkens/anaconda/lib/python3.4/site-packages/HTRC_Feature_Reader-0.1-py3.4.egg/htrc_features/volume.py in pages(self, **kwargs)
     66     def pages(self, **kwargs):
     67         for page in self._pages:
---> 68             yield Page(page, self, **kwargs)
     69 
     70     def tokens_per_page(self, **kwargs):

/Users/mwilkens/anaconda/lib/python3.4/site-packages/HTRC_Feature_Reader-0.1-py3.4.egg/htrc_features/page.py in __init__(self, pageobj, volume, default_section)
     36         for sec in ['header', 'body', 'footer']:
     37             secjson = getattr(self, sec)
---> 38             setattr(self, sec, Section(name=sec, data=secjson, page=self))
     39 
     40     @property

/Users/mwilkens/anaconda/lib/python3.4/site-packages/HTRC_Feature_Reader-0.1-py3.4.egg/htrc_features/page.py in __init__(self, name, data, page)
     86         self.name = name
     87         self.page = page
---> 88         self.tokenlist = TokenList(data['tokens'])
     89 
     90         for (key, value) in iteritems(data):

KeyError: 'tokens'

But it's entirely possible I'm doing something dumb here ...

organisciak commented 8 years ago

Matt, I never updated this repo to work with the most recent version of the Extracted Features dataset, so it's likely a bug on the library end. I'll take a look at this bug in particular and generally resume development over the next two days.

The bug doesn't seem related to the version change but I'll see when I dig into it. Version 2.0 added language inference and pulled out some info to an 'advanced' file, but nothing drastic happened: mainly, it was where we moved from 250k to 4.8 million volumes.

organisciak commented 8 years ago

By the way, checking versions and raising a warning: I was so responsible back then!

wilkens commented 8 years ago

I know, right? I'm impressed!

Thanks for looking into this. I'm working around it at the moment by writing my own extraction code, but your library is way better.

organisciak commented 8 years ago

Matt, I fixed this bug last week, but I've been taking the past two weeks to work heavily on an overhaul of this library, which has delayed me merging the experimental branch to master. I'm almost done: most of the code is ready, but I need to write tests and update documentation.

The updates are much, much faster at parsing the datasets. Also, there's a lot of refactoring around Pandas now, not just internally but also returning Pandas DataFrames. I expect this is the direction that will make this library most useful though it won't be backwards compatible because the responses are structured differently.

I'll send a note when I push the fix alongside the updates to the master branch.

wilkens commented 8 years ago

That sounds terrific on all fronts. Thanks so much for your work on the library. I'll try the new version whenever it's pushed. But no rush, of course, on my account.

organisciak commented 8 years ago

Matt, the code changes are in master now and I'm most of the way through writing tests to limit new bugs, so I'm closing this issue. Let me know if you use the library.

I see that you and Boris are writing custom code for your ACS project with Pandas, which should work great. The main benefit you might get from this library is in the speed of preparing the DataFrames now, building them up using structured arrays from Numpy then casting to a DataFrame. There are other improvements to be made, but the Numpy code gave me an approximately 40-fold increase in performance.

wilkens commented 8 years ago

Thanks, Peter - that sounds great. I appreciate both the compatibility and the speedup. Haven't had a chance to test yet, but will let you know as soon as I do. I'm sure it'll work well indeed!

On Jan 5, 2016, at 3:05 PM, Peter Organisciak notifications@github.com wrote:

Matt, the code changes are in master now and I'm most of the way through writing tests to limit new bugs, so I'm closing this issue. Let me know if you use the library.

I see that you and Boris are writing custom code for your ACS project with Pandas, which should work great. The main benefit you might get from this library is in the speed of preparing the DataFrames now, building them up using structured arrays from Numpy then casting to a DataFrame. There are other improvements to be made, but the Numpy code gave me an approximately 40-fold increase in performance.

— Reply to this email directly or view it on GitHub.